bio-experiment

Running on CPU Upgrade

App Files Files Community

Ev3Dev commited on 2 days ago

Commit

db03c40

verified ·

1 Parent(s): 5c3cfae

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

H100_JUPYTER_SETUP.md +199 -0
README.md +24 -1
_dashboard_state.json +234 -91
dashboard.html +26 -5
debug-904eee.log +10 -0
inference.ipynb +166 -0
pyproject.toml +1 -0
run_agent.py +138 -26
run_agent_unsloth.py +294 -0
server/hackathon_environment.py +2 -2
server/rewards/reward.py +51 -2
server/rules/engine.py +24 -0
tests/test_environment.py +33 -1
tests/test_rewards.py +50 -1
tests/test_rules.py +35 -14
tests/test_run_agent.py +59 -4
tests/test_training_script.py +10 -0
train.ipynb +141 -0
training_script.py +243 -22
training_unsloth.py +439 -0
uv.lock +0 -0

H100_JUPYTER_SETUP.md ADDED Viewed

	@@ -0,0 +1,199 @@

+# H100 Jupyter Notebook Setup
+This guide walks you through setting up the OpenEnv Bio Experiment environment on an **NVIDIA H100** Jupyter notebook instance (e.g., Jupiter Labs, Lambda Labs, RunPod, or similar).
+## Prerequisites
+- **Python** ≥ 3.10 (3.10, 3.11, or 3.12 recommended)
+- **uv** – fast Python package manager ([install instructions](#installing-uv))
+- **NVIDIA driver** ≥ 535.104.05 (usually pre-installed on H100 instances)
+- **CUDA** – H100 uses CUDA 12.x; PyTorch wheels bundle the runtime, so a separate CUDA Toolkit is not required
+## Installing uv
+If `uv` is not already installed:
+```bash
+# Unix/Linux (including Jupiter notebook terminals)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Or with pip
+pip install uv
+```
+Verify:
+```bash
+uv --version
+```
+## Quick Setup (Recommended)
+### 1. Clone and enter the project
+```bash
+git clone <repository-url> OpenENV-Hackathon
+cd OpenENV-Hackathon
+```
+### 2. Use uv's auto PyTorch backend
+uv can detect your GPU and pick the right PyTorch build. For H100 (CUDA 12.x):
+```bash
+# Install everything: core + training (TRL, transformers, torch, unsloth) + Jupyter
+UV_TORCH_BACKEND=cu128 uv sync --extra train
+# Add Jupyter kernel support
+uv add ipykernel jupyter --extra train
+```
+If `UV_TORCH_BACKEND=cu128` fails (e.g., cu128 wheels not available yet), try:
+```bash
+UV_TORCH_BACKEND=cu126 uv sync --extra train
+```
+### 3. Register the environment as a Jupyter kernel
+```bash
+uv run python -m ipykernel install --user --name openenv-bio --display-name "OpenEnv Bio (H100)"
+```
+### 4. Verify CUDA
+In a new Jupyter notebook, select the **"OpenEnv Bio (H100)"** kernel and run:
+```python
+import torch
+print(f"PyTorch: {torch.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
+```
+Expected output (or similar):
+```
+PyTorch: 2.x.x+cu128
+CUDA available: True
+GPU: NVIDIA H100 ...
+```
+### 5. Sanity check the environment
+```bash
+uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q
+```
+## Manual PyTorch CUDA Configuration
+If you need explicit control over the PyTorch index (e.g., for reproducibility), add the following to `pyproject.toml`:
+### Add to `pyproject.toml`
+```toml
+# After [tool.uv], add:
+[[tool.uv.index]]
+name = "pytorch-cu128"
+url = "https://download.pytorch.org/whl/cu128"
+explicit = true
+[tool.uv.sources]
+torch = [{ index = "pytorch-cu128" }]
+torchvision = [{ index = "pytorch-cu128" }]
+```
+Then run:
+```bash
+uv sync --extra train
+uv add ipykernel jupyter --extra train
+```
+For CUDA 12.6 instead of 12.8, use `cu126` in the index URL and source names.
+## Dependency Groups
+| uv sync flag      | Contents                                                                 |
+|-------------------|--------------------------------------------------------------------------|
+| *(default)*       | Core: `openenv-core`, `numpy`, `scipy`, `pydantic`                       |
+| `--extra dev`     | Testing: `pytest`, `pytest-cov`                                          |
+| `--extra train`   | Training: `torch`, `transformers`, `trl`, `accelerate`, `peft`, `unsloth`, etc. |
+| `--extra bio`     | Bioinformatics: `scanpy`, `biopython`, `gseapy`                          |
+| `--extra train --extra dev` | Combined for development + training                    |
+## Preferred H100 Workflow
+On H100, use the quantized Unsloth entrypoints:
+```bash
+uv run python training_unsloth.py --dry-run
+uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B --output-dir training/grpo-unsloth-output
+uv run python run_agent_unsloth.py
+```
+The checked-in `inference.ipynb` notebook now uses `training_unsloth.py` helpers with 4-bit loading and fast inference enabled by default.
+## Running Training in a Jupyter Notebook
+Example cell:
+```python
+# In a notebook with the OpenEnv Bio (H100) kernel
+!uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B --dry-run
+```
+Or run interactively from Python:
+```python
+import subprocess
+subprocess.run([
+    "uv", "run", "python", "training_unsloth.py",
+    "--model-id", "Qwen/Qwen3.5-4B",
+    "--output-dir", "training/grpo-unsloth-output",
+], check=True)
+```
+## Requirements Summary
+| Component      | Version / Notes                                      |
+|----------------|------------------------------------------------------|
+| Python         | ≥ 3.10                                               |
+| uv             | ≥ 0.5.3 (for PyTorch index support)                  |
+| torch          | ≥ 2.10.0 (cu128 or cu126 for H100)                   |
+| transformers   | ≥ 5.3.0                                              |
+| trl            | ≥ 0.29.0                                             |
+| accelerate     | ≥ 1.13.0                                             |
+| Jupyter        | Optional, for notebook workflows                     |
+## Troubleshooting
+### `torch.cuda.is_available()` is False
+- Confirm the Jupyter kernel is the one where you ran `uv sync` (the one with `ipykernel`).
+- Ensure no CPU-only PyTorch is overriding the CUDA build (e.g., from a different conda/pip env).
+- Run `uv run python -c "import torch; print(torch.__file__)"` to verify PyTorch comes from your project venv.
+### Flash Attention / causal-conv fallback warnings
+These are common and usually harmless; execution continues with a slower path. For best H100 performance, ensure `transformers` and `torch` are recent versions that support Flash Attention 2.
+### HuggingFace symlink warnings
+Set:
+```bash
+export HF_HUB_DISABLE_SYMLINKS_WARNING=1
+```
+### Out-of-memory during training
+- Reduce `--num-generations` or `--rollout-steps`.
+- Use a smaller model (e.g., `Qwen/Qwen3.5-0.8B`) for experiments.
+- Keep `--disable-4bit` off unless you explicitly need wider weights.
+## See Also
+- Main [README.md](README.md) for project overview, APIs, and usage
+- [uv PyTorch guide](https://docs.astral.sh/uv/guides/integration/pytorch/) for advanced PyTorch configuration

README.md CHANGED Viewed

@@ -249,6 +249,8 @@ An episode ends when one of the following happens:
 Dependencies are managed with `uv`. The package requires Python ≥ 3.10.
 ```bash
 # Core environment only
 uv sync
@@ -342,6 +344,13 @@ The environment class supports concurrent sessions, but the bundled server is cu
 uv run python run_agent.py
 ```
 Configuration is via environment variables:
 | Variable | Default | Description |
@@ -367,6 +376,14 @@ uv run python training_script.py --dry-run
 uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B
 ```
 Key arguments:
 | Argument | Default | Description |
@@ -381,13 +398,15 @@ Key arguments:
 | `--scenario-name` | all | Repeatable; restricts which scenarios are used |
 | `--domain-randomise` | off | Enable domain randomisation |
 | `--num-generations` | `4` | GRPO generations per prompt |
-| `--max-completion-length` | `220` | Max tokens for model completions |
 | `--max-prompt-length` | `768` | Max tokens for prompts |
 | `--learning-rate` | `5e-6` | AdamW learning rate |
 | `--dry-run` | off | Build data and test reward without training |
 By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with `--reward-backend remote --base-url http://localhost:8000`.
 After training, the script saves plots to the output directory:
 - `training_loss.png`
@@ -413,7 +432,9 @@ This runs N episodes with a `random` or `heuristic` policy, saves JSON trajector
 - `training/literature_benchmark.py` runs paper-aligned action sequences and compares outcomes against curated expected findings
 - `training/rollout_collection.py` collects direct environment rollouts into trajectory files
 - `training_script.py` trains a GRPO policy with OpenEnv reward calls
 - `run_agent.py` runs a local language model planner against the environment
 - `training/trajectory.py` stores trajectories for offline RL, imitation learning, replay, and evaluation
 - `training/evaluation.py` computes online, benchmark, expert-review, and fidelity-oriented metrics
@@ -488,6 +509,7 @@ That makes it suitable for:
 ├── openenv.yaml                  # OpenEnv platform deployment config
 ├── pyproject.toml                # Package metadata and dependency groups
 ├── run_agent.py                  # Single-episode interactive agent runner
 ├── server/
 │   ├── app.py                    # FastAPI/OpenEnv server entry point
 │   ├── Dockerfile                # Multi-stage Docker build
@@ -512,6 +534,7 @@ That makes it suitable for:
 │   ├── rollout_collection.py     # Direct rollout collection helper
 │   └── trajectory.py             # Trajectory serialization and dataset utilities
 ├── training_script.py            # TRL GRPO training entry point
 └── tests/
     ├── test_environment.py
     ├── test_literature_benchmark.py

 Dependencies are managed with `uv`. The package requires Python ≥ 3.10.
+> **H100 Jupyter notebook setup:** See [H100_JUPYTER_SETUP.md](H100_JUPYTER_SETUP.md) for environment setup on NVIDIA H100 instances with Jupyter.
 ```bash
 # Core environment only
 uv sync
 uv run python run_agent.py
 ```
+For H100 and other large-GPU workflows, prefer the quantized Unsloth path:
+```bash
+uv sync --extra train
+uv run python run_agent_unsloth.py
+```
 Configuration is via environment variables:
 | Variable | Default | Description |
 uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B
 ```
+For H100, the preferred entrypoint is `training_unsloth.py`, which uses Unsloth 4-bit loading plus LoRA for faster quantized GRPO training:
+```bash
+uv sync --extra train
+uv run python training_unsloth.py --dry-run
+uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B
+```
 Key arguments:
 | Argument | Default | Description |
 | `--scenario-name` | all | Repeatable; restricts which scenarios are used |
 | `--domain-randomise` | off | Enable domain randomisation |
 | `--num-generations` | `4` | GRPO generations per prompt |
+| `--max-completion-length` | `160` | Max tokens for model completions |
 | `--max-prompt-length` | `768` | Max tokens for prompts |
 | `--learning-rate` | `5e-6` | AdamW learning rate |
 | `--dry-run` | off | Build data and test reward without training |
 By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with `--reward-backend remote --base-url http://localhost:8000`.
+`training_unsloth.py` adds H100-oriented options such as `--max-seq-length`, `--disable-4bit`, `--disable-fast-inference`, and LoRA settings (`--lora-r`, `--lora-alpha`, `--lora-dropout`).
 After training, the script saves plots to the output directory:
 - `training_loss.png`
 - `training/literature_benchmark.py` runs paper-aligned action sequences and compares outcomes against curated expected findings
 - `training/rollout_collection.py` collects direct environment rollouts into trajectory files
 - `training_script.py` trains a GRPO policy with OpenEnv reward calls
+- `training_unsloth.py` trains a quantized GRPO policy with Unsloth on H100-class GPUs
 - `run_agent.py` runs a local language model planner against the environment
+- `run_agent_unsloth.py` runs the planner with Unsloth 4-bit loading for faster inference
 - `training/trajectory.py` stores trajectories for offline RL, imitation learning, replay, and evaluation
 - `training/evaluation.py` computes online, benchmark, expert-review, and fidelity-oriented metrics
 ├── openenv.yaml                  # OpenEnv platform deployment config
 ├── pyproject.toml                # Package metadata and dependency groups
 ├── run_agent.py                  # Single-episode interactive agent runner
+├── run_agent_unsloth.py          # Quantized Unsloth interactive agent runner
 ├── server/
 │   ├── app.py                    # FastAPI/OpenEnv server entry point
 │   ├── Dockerfile                # Multi-stage Docker build
 │   ├── rollout_collection.py     # Direct rollout collection helper
 │   └── trajectory.py             # Trajectory serialization and dataset utilities
 ├── training_script.py            # TRL GRPO training entry point
+├── training_unsloth.py           # Unsloth quantized GRPO training entry point
 └── tests/
     ├── test_environment.py
     ├── test_literature_benchmark.py

_dashboard_state.json CHANGED Viewed

@@ -1,40 +1,40 @@
 {
-  "timestamp": 1772961529.009918,
-  "step": 4,
   "episode_done": false,
-  "cumulative_reward": 1.8781281693998408,
-  "gen_time_s": 15.02,
-  "model_response_raw": "{\n\"action_type\": \"prepare_library\",\n\"method\": \"\",\n\"parameters\": {\n\"cutoff_value\": \"\"\n},\n\"Justification\": \"The error 'Cannot sequence without library preparation' indicates that while sample collection was completed successfully via dry-polymerization technology ('collectible'), mature cDNA libraries have yet to be constructed from those frozen tissues. Library prep requires reverse transcription followed by amplification/cleanup using polymerases/ligation enzymes. Since all other required conditions met but materials remain unprocessed, preparing high-throughput Illumina libraries first allows",
   "model_thinking": "",
   "thinking_enabled": false,
   "task": {
-    "problem_statement": "Determine the effect of anti-TNF biologic therapy on cell states in synovium tissue affected by rheumatoid arthritis.",
     "organism": "human",
-    "tissue": "synovium",
     "modality": "scRNA-seq",
     "conditions": [
-      "untreated_rheumatoid_arthritis",
-      "anti_TNF_treated"
     ],
-    "budget_limit": 80340.57829173621,
-    "time_limit_days": 140.67464277202689
   },
   "resources": {
-    "budget_used": 13500.0,
-    "budget_remaining": 66840.58,
-    "time_used_days": 11.0,
-    "time_remaining_days": 129.7,
     "samples_consumed": 0,
-    "compute_hours_used": 0.0
   },
   "pipeline_history": [
     {
       "step_index": 1,
       "action_type": "collect_sample",
       "method": null,
-      "output_summary": "Collected 6 samples (quality=0.54)",
       "success": true,
-      "quality_score": 0.541,
       "resource_cost": 5000.0,
       "time_cost_days": 7.0
     },
@@ -62,116 +62,259 @@
       "step_index": 4,
       "action_type": "prepare_library",
       "method": null,
-      "output_summary": "Library prepared (complexity=0.72)",
       "success": true,
-      "quality_score": 0.718,
       "resource_cost": 8000.0,
       "time_cost_days": 3.0
     }
   ],
   "current_action": {
-    "action_type": "prepare_library",
     "method": null,
     "parameters": {
-      "cutoff_value": ""
     },
-    "justification": "The error 'Cannot sequence without library preparation' indicates that while sample collection was completed successfully via dry-polymerization technology ('collectible'), mature cDNA libraries have yet to be constructed from those frozen tissues. Library prep requires reverse transcription followed by amplification/cleanup using polymerases/ligation enzymes. Since all other required conditions met but materials remain unprocessed, preparing high-throughput Illumina libraries first allows conversion of input DNA/RNA directly into sequencable formats.",
-    "confidence": 0.95
   },
   "latest_output": {
-    "summary": "Library prepared (complexity=0.72)",
-    "success": true,
-    "quality_score": 0.718,
     "uncertainty": 0.0,
     "warnings": [],
-    "data_preview": "{\"library_complexity\": 0.7177962559040082, \"method\": \"10x_chromium\"}"
   },
   "discovered_markers": [],
   "candidate_mechanisms": [],
-  "rule_violations": [],
   "uncertainty_summary": {
-    "avg_uncertainty": 0.0,
-    "avg_quality": 0.815
   },
-  "reward_breakdown": {},
   "latent": {
     "cell_populations": [
       {
-        "name": "fibroblast",
-        "proportion": 0.445,
-        "marker_genes": [
-          "COL1A1",
-          "FAP",
-          "THY1"
-        ],
-        "state": "activated"
-      },
-      {
-        "name": "CD4_T_cell",
-        "proportion": 0.179,
         "marker_genes": [
-          "CD3D",
-          "CD4",
-          "IL7R"
         ],
-        "state": "quiescent"
       },
       {
-        "name": "CD8_T_cell",
-        "proportion": 0.139,
         "marker_genes": [
-          "CD3D",
-          "CD8A",
-          "GZMB"
         ],
-        "state": "activated"
       },
       {
-        "name": "B_cell",
-        "proportion": 0.142,
         "marker_genes": [
-          "CD19",
-          "MS4A1",
-          "CD79A"
         ],
-        "state": "quiescent"
       },
       {
         "name": "endothelial",
-        "proportion": 0.096,
         "marker_genes": [
-          "PECAM1",
-          "VWF"
         ],
         "state": "quiescent"
       }
     ],
     "true_markers": [
-      "TNF",
-      "IL6",
-      "MMP3",
-      "CXCL13"
     ],
     "causal_mechanisms": [
-      "TNF/NF-kB-driven synovial inflammation",
-      "Th17-mediated cartilage destruction via MMPs"
     ],
     "true_pathways": {
-      "JAK_STAT_signalling": 0.785,
-      "TNF_signalling": 0.723,
-      "Th17_differentiation": 0.633,
-      "NF_kB_signalling": 0.826,
-      "matrix_metalloproteinase_activity": 0.847
-    },
-    "true_de_genes_count": 9,
-    "true_regulatory_network_size": 16,
     "confounders": {},
-    "n_true_cells": 15873,
     "technical": {
-      "ambient_rna_fraction": 0.037873267501661645,
-      "doublet_rate": 0.03797665930677535,
-      "dropout_rate": 0.14738025069803395,
-      "sample_quality": 0.9068064354870293,
       "library_complexity": 0.8,
       "capture_efficiency": 0.6
     },
@@ -181,24 +324,24 @@
       "cells_cultured": false,
       "library_prepared": true,
       "perturbation_applied": false,
-      "cells_sequenced": false,
-      "qc_performed": false,
-      "data_filtered": false,
-      "data_normalized": false,
       "batches_integrated": false,
-      "cells_clustered": false,
       "de_performed": false,
       "trajectories_inferred": false,
       "pathways_analyzed": false,
       "networks_inferred": false,
       "markers_discovered": false,
       "markers_validated": false,
-      "followup_designed": false,
       "subagent_review_requested": false,
       "conclusion_reached": false,
-      "n_cells_sequenced": null,
-      "n_cells_after_filter": null,
-      "n_clusters_found": null,
       "n_de_genes_found": null,
       "n_markers_found": null
     },

 {
+  "timestamp": 1772965877.7057643,
+  "step": 20,
   "episode_done": false,
+  "cumulative_reward": -1.0267290925637838,
+  "gen_time_s": 28.83,
+  "model_response_raw": "{\n\"action_type\": \"marker_selection\",\n\"method\": \"\",\n\"parameters\": {\"\"diff_genes\"\": [],\"gene_lists_file\":\"DE_results.h5ad\"},\n\"Justification\": \"The 'violation' error indicates we cannot proceed directly from previous failed attempts due to incomplete filtering/normalization logic implied by the blockage condition ('without prior analysis'). However, looking strictly at the provided completed vs failing log chain where `design` was followed by `[OK] design`, but then immediate failures occurred (`synthesis_blocked`) which were resolved upon finding/deepening insights? Wait, re-evaluating strict fl",
   "model_thinking": "",
   "thinking_enabled": false,
   "task": {
+    "problem_statement": "Validate candidate biomarker TREM2 for Alzheimer's disease in brain tissue using single-cell RNA sequencing.",
     "organism": "human",
+    "tissue": "brain",
     "modality": "scRNA-seq",
     "conditions": [
+      "healthy",
+      "Alzheimer"
     ],
+    "budget_limit": 115264.15272286453,
+    "time_limit_days": 111.8768783273548
   },
   "resources": {
+    "budget_used": 28900.0,
+    "budget_remaining": 86364.15,
+    "time_used_days": 18.0,
+    "time_remaining_days": 93.9,
     "samples_consumed": 0,
+    "compute_hours_used": 12.0
   },
   "pipeline_history": [
     {
       "step_index": 1,
       "action_type": "collect_sample",
       "method": null,
+      "output_summary": "Collected 6 samples (quality=0.57)",
       "success": true,
+      "quality_score": 0.575,
       "resource_cost": 5000.0,
       "time_cost_days": 7.0
     },
       "step_index": 4,
       "action_type": "prepare_library",
       "method": null,
+      "output_summary": "Library prepared (complexity=0.77)",
       "success": true,
+      "quality_score": 0.77,
       "resource_cost": 8000.0,
       "time_cost_days": 3.0
+    },
+    {
+      "step_index": 5,
+      "action_type": "sequence_cells",
+      "method": null,
+      "output_summary": "Sequenced 12178 cells, 9490 genes detected, median UMI=2966",
+      "success": true,
+      "quality_score": 0.457,
+      "resource_cost": 15000.0,
+      "time_cost_days": 5.0
+    },
+    {
+      "step_index": 6,
+      "action_type": "filter_data",
+      "method": null,
+      "output_summary": "Action blocked: Cannot filter data before QC",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 50.0,
+      "time_cost_days": 0.2
+    },
+    {
+      "step_index": 7,
+      "action_type": "run_qc",
+      "method": null,
+      "output_summary": "QC metrics computed",
+      "success": true,
+      "quality_score": 0.842,
+      "resource_cost": 100.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 8,
+      "action_type": "filter_data",
+      "method": null,
+      "output_summary": "Filtered 12178 \u2192 9967 cells (82% retained)",
+      "success": true,
+      "quality_score": 0.818,
+      "resource_cost": 50.0,
+      "time_cost_days": 0.2
+    },
+    {
+      "step_index": 9,
+      "action_type": "normalize_data",
+      "method": null,
+      "output_summary": "Normalized with log_normalize",
+      "success": true,
+      "quality_score": 1.0,
+      "resource_cost": 50.0,
+      "time_cost_days": 0.2
+    },
+    {
+      "step_index": 10,
+      "action_type": "marker_selection",
+      "method": null,
+      "output_summary": "Action blocked: Cannot select markers without DE results",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 100.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 11,
+      "action_type": "design_followup_experiment",
+      "method": null,
+      "output_summary": "Action blocked: Follow-up design without prior analysis is blocked; complete wet-lab and computational steps first",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 100.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 12,
+      "action_type": "cluster_cells",
+      "method": null,
+      "output_summary": "Found 3 clusters",
+      "success": true,
+      "quality_score": 0.765,
+      "resource_cost": 100.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 13,
+      "action_type": "design_followup_experiment",
+      "method": null,
+      "output_summary": "Follow-up experiment design proposed (evidence_signals=1)",
+      "success": true,
+      "quality_score": 0.28,
+      "resource_cost": 100.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 14,
+      "action_type": "synthesize_conclusion",
+      "method": null,
+      "output_summary": "Action blocked: Cannot synthesise conclusion before discovering or validating markers; Cannot synthesise conclusion befo",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 0.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 15,
+      "action_type": "marker_selection",
+      "method": null,
+      "output_summary": "Action blocked: Cannot select markers without DE results",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 100.0,
+      "time_cost_days": 0.5
+    },
+    {
+      "step_index": 16,
+      "action_type": "synthesize_conclusion",
+      "method": null,
+      "output_summary": "Action blocked: Cannot synthesise conclusion before discovering or validating markers; Cannot synthesise conclusion befo",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 0.0,
+      "time_cost_days": 0.5
     }
   ],
   "current_action": {
+    "action_type": "synthesize_conclusion",
     "method": null,
     "parameters": {
+      "claims": [
+        {
+          "top_markers": [],
+          "causal_mechanisms": [],
+          "predicted_pathways": {},
+          "confidence": 0.5,
+          "claim_type": "correlational",
+          "claim": "forced terminal conclusion"
+        }
+      ]
     },
+    "justification": "forced terminal conclusion",
+    "confidence": 0.5
   },
   "latest_output": {
+    "summary": "Action blocked: Cannot synthesise conclusion before discovering or validating markers; Cannot synthesise conclusion before inferring pathways or mechanisms",
+    "success": false,
+    "quality_score": 1.0,
     "uncertainty": 0.0,
     "warnings": [],
+    "data_preview": null
   },
   "discovered_markers": [],
   "candidate_mechanisms": [],
+  "rule_violations": [
+    "Cannot synthesise conclusion before discovering or validating markers",
+    "Cannot synthesise conclusion before inferring pathways or mechanisms"
+  ],
   "uncertainty_summary": {
+    "avg_uncertainty": 0.194,
+    "avg_quality": 0.809
+  },
+  "reward_breakdown": {
+    "validity": -1.0,
+    "ordering": 0.0,
+    "info_gain": 0.0,
+    "efficiency": 0.0,
+    "novelty": 0.0,
+    "penalty": -1.0,
+    "shaping": 0.0,
+    "terminal": 0.0,
+    "total": -2.0,
+    "hard_violations": 2.0,
+    "term_validity": 0.0,
+    "term_ordering": 0.0,
+    "term_info_gain": 0.0,
+    "term_efficiency": 0.0,
+    "term_novelty": 0.0,
+    "term_penalty": 0.0,
+    "term_shaping": 0.0,
+    "term_terminal": 0.0,
+    "term_total": 0.0
   },
   "latent": {
     "cell_populations": [
       {
+        "name": "excitatory_neuron",
+        "proportion": 0.425,
         "marker_genes": [
+          "SLC17A7",
+          "CAMK2A",
+          "NRGN"
         ],
+        "state": "stressed"
       },
       {
+        "name": "inhibitory_neuron",
+        "proportion": 0.346,
         "marker_genes": [
+          "GAD1",
+          "GAD2",
+          "SLC32A1"
         ],
+        "state": "normal"
       },
       {
+        "name": "OPC",
+        "proportion": 0.093,
         "marker_genes": [
+          "PDGFRA",
+          "CSPG4",
+          "OLIG2"
         ],
+        "state": "progenitor"
       },
       {
         "name": "endothelial",
+        "proportion": 0.137,
         "marker_genes": [
+          "CLDN5",
+          "FLT1",
+          "PECAM1"
         ],
         "state": "quiescent"
       }
     ],
     "true_markers": [
+      "TREM2",
+      "APOE",
+      "GFAP"
     ],
     "causal_mechanisms": [
+      "TREM2-mediated microglial activation in amyloid clearance",
+      "complement-driven synaptic pruning",
+      "reactive astrogliosis amplifying neuroinflammation"
     ],
     "true_pathways": {
+      "complement_cascade": 0.827,
+      "neuroinflammation": 0.666,
+      "amyloid_processing": 0.733,
+      "synaptic_signalling": 0.438,
+      "lipid_metabolism": 0.616
+    },
+    "true_de_genes_count": 10,
+    "true_regulatory_network_size": 0,
     "confounders": {},
+    "n_true_cells": 20321,
     "technical": {
+      "ambient_rna_fraction": 0.050723618495539344,
+      "doublet_rate": 0.0546771548836933,
+      "dropout_rate": 0.05122168063297322,
+      "sample_quality": 0.937985596833521,
       "library_complexity": 0.8,
       "capture_efficiency": 0.6
     },
       "cells_cultured": false,
       "library_prepared": true,
       "perturbation_applied": false,
+      "cells_sequenced": true,
+      "qc_performed": true,
+      "data_filtered": true,
+      "data_normalized": true,
       "batches_integrated": false,
+      "cells_clustered": true,
       "de_performed": false,
       "trajectories_inferred": false,
       "pathways_analyzed": false,
       "networks_inferred": false,
       "markers_discovered": false,
       "markers_validated": false,
+      "followup_designed": true,
       "subagent_review_requested": false,
       "conclusion_reached": false,
+      "n_cells_sequenced": 12178,
+      "n_cells_after_filter": 9967,
+      "n_clusters_found": "3",
       "n_de_genes_found": null,
       "n_markers_found": null
     },

dashboard.html CHANGED Viewed

@@ -304,6 +304,20 @@ function esc(s) { if (s == null) return '—'; const d = document.createElement(
 function pct(used, total) { if (!total) return 0; return Math.min(100, Math.max(0, (used / total) * 100)); }
 function gaugeColor(p) { return p < 50 ? 'var(--green)' : p < 80 ? 'var(--amber)' : 'var(--red)'; }
 function fmt(n) { if (n == null) return '0'; return Number(n).toLocaleString('en-US', { maximumFractionDigits: 0 }); }
 function gauge(label, value, pctVal, inv) {
   let bar = '';
   if (pctVal != null) { const c = inv ? gaugeColor(100-pctVal) : gaugeColor(pctVal); bar = `<div class="gauge-bar"><div class="gauge-bar-fill" style="width:${pctVal.toFixed(1)}%;background:${c}"></div></div>`; }
@@ -360,7 +374,10 @@ function showReport() {
   const conc = s.conclusions || [];
   const trueM = lat.true_markers || [];
   const trueMech = lat.causal_mechanisms || [];
-  const agentM = s.discovered_markers || [];
   const markerHits = agentM.filter(m => trueM.some(t => t.toUpperCase() === m.toUpperCase()));
   const r = s.resources || {};
@@ -393,7 +410,7 @@ function showReport() {
   html += `<div class="report-section"><h3>Ground Truth Comparison</h3>
     <div class="comparison-row"><div class="comparison-col"><h4>Agent's Markers</h4><div class="tag-list">${comparedTags(agentM, trueM, 'green')}</div></div>
     <div class="comparison-col"><h4>True Markers</h4><div class="tag-list">${tagsHTML(trueM,'green')}</div></div></div>
-    <div class="comparison-row"><div class="comparison-col"><h4>Agent's Mechanisms</h4><div class="tag-list">${comparedTags(s.candidate_mechanisms, trueMech, 'pink')}</div></div>
     <div class="comparison-col"><h4>True Mechanisms</h4><div class="tag-list">${tagsHTML(trueMech,'pink')}</div></div></div>
   </div>`;
@@ -473,12 +490,16 @@ function renderState(s) {
   // Ground truth comparison (visible when done or has conclusions)
   const lat = s.latent;
   if ((s.episode_done || conc.length) && lat) {
     $('card-gt-comparison').style.display = '';
-    setHTML('gt-agent-markers', comparedTags(s.discovered_markers, lat.true_markers, 'green'));
     setHTML('gt-true-markers', tagsHTML(lat.true_markers, 'green'));
-    setHTML('gt-agent-mechs', comparedTags(s.candidate_mechanisms, lat.causal_mechanisms, 'pink'));
     setHTML('gt-true-mechs', tagsHTML(lat.causal_mechanisms, 'pink'));
-    const hits = (s.discovered_markers||[]).filter(m => (lat.true_markers||[]).some(t => t.toUpperCase()===m.toUpperCase()));
     $('gt-score').innerHTML = `Marker accuracy: <span style="color:var(--accent)">${hits.length}</span> / ${(lat.true_markers||[]).length} true markers recovered`;
   } else { $('card-gt-comparison').style.display = 'none'; }

 function pct(used, total) { if (!total) return 0; return Math.min(100, Math.max(0, (used / total) * 100)); }
 function gaugeColor(p) { return p < 50 ? 'var(--green)' : p < 80 ? 'var(--amber)' : 'var(--red)'; }
 function fmt(n) { if (n == null) return '0'; return Number(n).toLocaleString('en-US', { maximumFractionDigits: 0 }); }
+function uniqueItems(arr) {
+  const out = [];
+  const seen = new Set();
+  (arr || []).forEach(item => {
+    if (item == null) return;
+    const text = String(item).trim();
+    if (!text) return;
+    const key = text.toUpperCase();
+    if (seen.has(key)) return;
+    seen.add(key);
+    out.push(text);
+  });
+  return out;
+}
 function gauge(label, value, pctVal, inv) {
   let bar = '';
   if (pctVal != null) { const c = inv ? gaugeColor(100-pctVal) : gaugeColor(pctVal); bar = `<div class="gauge-bar"><div class="gauge-bar-fill" style="width:${pctVal.toFixed(1)}%;background:${c}"></div></div>`; }
   const conc = s.conclusions || [];
   const trueM = lat.true_markers || [];
   const trueMech = lat.causal_mechanisms || [];
+  const conclusionMarkers = uniqueItems(conc.flatMap(c => c.top_markers || []));
+  const conclusionMechanisms = uniqueItems(conc.flatMap(c => c.causal_mechanisms || []));
+  const agentM = uniqueItems((s.discovered_markers && s.discovered_markers.length) ? s.discovered_markers : conclusionMarkers);
+  const agentMechanisms = uniqueItems((s.candidate_mechanisms && s.candidate_mechanisms.length) ? s.candidate_mechanisms : conclusionMechanisms);
   const markerHits = agentM.filter(m => trueM.some(t => t.toUpperCase() === m.toUpperCase()));
   const r = s.resources || {};
   html += `<div class="report-section"><h3>Ground Truth Comparison</h3>
     <div class="comparison-row"><div class="comparison-col"><h4>Agent's Markers</h4><div class="tag-list">${comparedTags(agentM, trueM, 'green')}</div></div>
     <div class="comparison-col"><h4>True Markers</h4><div class="tag-list">${tagsHTML(trueM,'green')}</div></div></div>
+    <div class="comparison-row"><div class="comparison-col"><h4>Agent's Mechanisms</h4><div class="tag-list">${comparedTags(agentMechanisms, trueMech, 'pink')}</div></div>
     <div class="comparison-col"><h4>True Mechanisms</h4><div class="tag-list">${tagsHTML(trueMech,'pink')}</div></div></div>
   </div>`;
   // Ground truth comparison (visible when done or has conclusions)
   const lat = s.latent;
   if ((s.episode_done || conc.length) && lat) {
+    const conclusionMarkers = uniqueItems(conc.flatMap(c => c.top_markers || []));
+    const conclusionMechanisms = uniqueItems(conc.flatMap(c => c.causal_mechanisms || []));
+    const comparisonMarkers = uniqueItems((s.discovered_markers && s.discovered_markers.length) ? s.discovered_markers : conclusionMarkers);
+    const comparisonMechanisms = uniqueItems((s.candidate_mechanisms && s.candidate_mechanisms.length) ? s.candidate_mechanisms : conclusionMechanisms);
     $('card-gt-comparison').style.display = '';
+    setHTML('gt-agent-markers', comparedTags(comparisonMarkers, lat.true_markers, 'green'));
     setHTML('gt-true-markers', tagsHTML(lat.true_markers, 'green'));
+    setHTML('gt-agent-mechs', comparedTags(comparisonMechanisms, lat.causal_mechanisms, 'pink'));
     setHTML('gt-true-mechs', tagsHTML(lat.causal_mechanisms, 'pink'));
+    const hits = comparisonMarkers.filter(m => (lat.true_markers||[]).some(t => t.toUpperCase()===m.toUpperCase()));
     $('gt-score').innerHTML = `Marker accuracy: <span style="color:var(--accent)">${hits.length}</span> / ${(lat.true_markers||[]).length} true markers recovered`;
   } else { $('card-gt-comparison').style.display = 'none'; }

debug-904eee.log ADDED Viewed

	@@ -0,0 +1,10 @@

+{"sessionId": "904eee", "message": "repair_failed", "data": {"input_tail": "n studies. Sequencing ensures uniform coverage which aids robustness during subsequent filtering.\",\n\"Confidence\": 0.75\n}", "repaired_tail": "n studies. Sequencing ensures uniform coverage which aids robustness during subsequent filtering.\",\n\"Confidence\": 0.75\n}", "hypothesisId": "H1"}, "timestamp": 1772961980164}
+{"sessionId": "904eee", "message": "extract_failed", "data": {"normalized_tail": "ssues common in multi-condition studies. Sequencing ensures uniform coverage which aids robustness during subsequent filtering.\",\n\"Confidence\": 0.75\n}", "repair_returned": false, "last_json_err": "Expecting ':' delimiter: line 1 column 4 (char 3)", "has_python_none": false, "hypothesisId": "H4"}, "timestamp": 1772961980165}
+{"sessionId": "904eee", "message": "repair_failed", "data": {"input_tail": "ia unsupervised learning using dimension reduction techniques like PCA/UMAP from normalized data.\",\n\"Confidence\": 0.95\n}", "repaired_tail": "a unsupervised learning using dimension reduction techniques like PCA/UMAP from normalized data.\",\n\"Confidence\": 0.95\n}\"", "hypothesisId": "H1"}, "timestamp": 1772962029639}
+{"sessionId": "904eee", "message": "extract_failed", "data": {"normalized_tail": "inct cell types/states first via unsupervised learning using dimension reduction techniques like PCA/UMAP from normalized data.\",\n\"Confidence\": 0.95\n}", "repair_returned": false, "last_json_err": "Expecting ':' delimiter: line 1 column 4 (char 3)", "has_python_none": false, "hypothesisId": "H4"}, "timestamp": 1772962029640}
+{"sessionId": "904eee", "message": "repair_failed", "data": {"input_tail": "usters via downsampling/filtering based on read depth and genome coverage thresholds is mandatory.\",\n\"Confidence\": 0.9\n}", "repaired_tail": "usters via downsampling/filtering based on read depth and genome coverage thresholds is mandatory.\",\n\"Confidence\": 0.9\n}", "hypothesisId": "H1"}, "timestamp": 1772962075204}
+{"sessionId": "904eee", "message": "extract_failed", "data": {"normalized_tail": "ighly degraded/contaminated clusters via downsampling/filtering based on read depth and genome coverage thresholds is mandatory.\",\n\"Confidence\": 0.9\n}", "repair_returned": false, "last_json_err": "Expecting value: line 2 column 36 (char 37)", "has_python_none": false, "hypothesisId": "H4"}, "timestamp": 1772962075205}
+{"sessionId": "904eee", "message": "repair_failed", "data": {"input_tail": "factors rather than dynamic processes like maturation pathways required for trajectory inference.\",\n\"Confidence\": 0.95\n}", "repaired_tail": "factors rather than dynamic processes like maturation pathways required for trajectory inference.\",\n\"Confidence\": 0.95\n}", "hypothesisId": "H1"}, "timestamp": 1772962101951}
+{"sessionId": "904eee", "message": "extract_failed", "data": {"normalized_tail": "identify static transcription factors rather than dynamic processes like maturation pathways required for trajectory inference.\",\n\"Confidence\": 0.95\n}", "repair_returned": false, "last_json_err": "Expecting ':' delimiter: line 1 column 4 (char 3)", "has_python_none": false, "hypothesisId": "H4"}, "timestamp": 1772962101951}
+{"sessionId": "904eee", "message": "repair_failed", "data": {"input_tail": "s unique to intermediate progenitors versus terminally differentiated cells within those clusters.\",\n\"Confidence\": 0.9\n}", "repaired_tail": "s unique to intermediate progenitors versus terminally differentiated cells within those clusters.\",\n\"Confidence\": 0.9\n}", "hypothesisId": "H1"}, "timestamp": 1772962135915}
+{"sessionId": "904eee", "message": "extract_failed", "data": {"normalized_tail": "ndance transcriptional features unique to intermediate progenitors versus terminally differentiated cells within those clusters.\",\n\"Confidence\": 0.9\n}", "repair_returned": false, "last_json_err": "Expecting ':' delimiter: line 1 column 4 (char 3)", "has_python_none": false, "hypothesisId": "H4"}, "timestamp": 1772962135916}

inference.ipynb ADDED Viewed

	@@ -0,0 +1,166 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Self-Driving Lab Inference on H100 With Unsloth\n",
+        "\n",
+        "This notebook loads a quantized Unsloth model, builds the same self-driving lab observation prompt used during training, generates the next structured lab action, and steps the simulator in a short closed-loop rollout similar to `run_agent.py`, but with faster 4-bit inference on H100."
+      ],
+      "id": "a9d34036"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "%pip install -q -U torch transformers unsloth"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "20b36e01"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import json\n",
+        "\n",
+        "import torch\n",
+        "\n",
+        "from training_script import format_observation\n",
+        "from training_unsloth import generate_action_with_model, load_model_artifacts\n",
+        "from server.hackathon_environment import BioExperimentEnvironment\n",
+        "\n",
+        "print(\"CUDA available:\", torch.cuda.is_available())\n",
+        "if torch.cuda.is_available():\n",
+        "    print(\"GPU:\", torch.cuda.get_device_name(0))\n",
+        "    print(\"bf16 supported:\", torch.cuda.is_bf16_supported())"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "bcf24a2e"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "MODEL_PATH = \"artifacts/grpo-unsloth-output\"  # or a Hugging Face repo / base model id\n",
+        "SCENARIO_NAME = \"cardiac_disease_de\"\n",
+        "SEED = 42\n",
+        "\n",
+        "tokenizer, model = load_model_artifacts(\n",
+        "    MODEL_PATH,\n",
+        "    trust_remote_code=True,\n",
+        "    max_seq_length=2048,\n",
+        "    load_in_4bit=True,\n",
+        "    fast_inference=True,\n",
+        "    prepare_for_inference=True,\n",
+        ")\n",
+        "\n",
+        "env = BioExperimentEnvironment(scenario_name=SCENARIO_NAME, domain_randomise=False)\n",
+        "obs = env.reset(seed=SEED)\n",
+        "print(format_observation(obs)[:3000])"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "c54f2cfd"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "result = generate_action_with_model(\n",
+        "    model,\n",
+        "    tokenizer,\n",
+        "    obs,\n",
+        "    max_new_tokens=160,\n",
+        "    temperature=0.2,\n",
+        "    top_p=0.9,\n",
+        "    do_sample=True,\n",
+        ")\n",
+        "\n",
+        "print(\"Model response:\\n\")\n",
+        "print(result[\"response_text\"])\n",
+        "print(\"\\nParsed action:\\n\")\n",
+        "result[\"action\"].model_dump() if result[\"action\"] is not None else None"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "f9b25208"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "if result[\"action\"] is not None:\n",
+        "    next_obs = env.step(result[\"action\"])\n",
+        "    print(\"Reward:\", next_obs.reward)\n",
+        "    print(\"Done:\", next_obs.done)\n",
+        "    print(\"Violations:\", next_obs.rule_violations)\n",
+        "    print(\"Markers:\", next_obs.discovered_markers[:5])\n",
+        "    print(\"Mechanisms:\", next_obs.candidate_mechanisms[:5])\n",
+        "    if next_obs.latest_output is not None:\n",
+        "        print(\"Summary:\", next_obs.latest_output.summary)\n",
+        "        print(\"Latest data preview:\")\n",
+        "        print(json.dumps(next_obs.latest_output.data, indent=2)[:1200])\n",
+        "else:\n",
+        "    print(\"Model output did not parse into an ExperimentAction.\")"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "c2408f52"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Optional short closed-loop rollout.\n",
+        "obs = env.reset(seed=7)\n",
+        "trajectory = []\n",
+        "\n",
+        "for step_idx in range(8):\n",
+        "    result = generate_action_with_model(model, tokenizer, obs, max_new_tokens=160)\n",
+        "    action = result[\"action\"]\n",
+        "    record = {\n",
+        "        \"step\": step_idx + 1,\n",
+        "        \"response_text\": result[\"response_text\"],\n",
+        "        \"action\": action.model_dump() if action is not None else None,\n",
+        "    }\n",
+        "    trajectory.append(record)\n",
+        "    if action is None:\n",
+        "        break\n",
+        "\n",
+        "    next_obs = env.step(action)\n",
+        "    record.update({\n",
+        "        \"reward\": next_obs.reward,\n",
+        "        \"done\": next_obs.done,\n",
+        "        \"violations\": list(next_obs.rule_violations),\n",
+        "        \"latest_summary\": next_obs.latest_output.summary if next_obs.latest_output is not None else None,\n",
+        "        \"discovered_markers\": list(next_obs.discovered_markers[:5]),\n",
+        "        \"candidate_mechanisms\": list(next_obs.candidate_mechanisms[:5]),\n",
+        "    })\n",
+        "    obs = next_obs\n",
+        "    if obs.done:\n",
+        "        break\n",
+        "\n",
+        "trajectory"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "8af34f32"
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

pyproject.toml CHANGED Viewed

@@ -39,6 +39,7 @@ train = [
     "torch>=2.10.0",
     "transformers>=5.3.0",
     "trl>=0.29.0",
 ]
 [project.scripts]

     "torch>=2.10.0",
     "transformers>=5.3.0",
     "trl>=0.29.0",
+    "unsloth",
 ]
 [project.scripts]

run_agent.py CHANGED Viewed

@@ -16,6 +16,7 @@ from models import (
     ActionType,
     ExperimentAction,
     ExperimentObservation,
     build_agent_observation_context,
     build_agent_system_prompt,
 )
@@ -205,6 +206,13 @@ def _strip_js_comments(text: str) -> str:
 def extract_json_object(text: str) -> Optional[Dict[str, Any]]:
     stripped = _normalize_jsonish_text(text).strip()
     fence_prefix = "```"
     if stripped.startswith(fence_prefix) and stripped.endswith(fence_prefix):
         lines = stripped.splitlines()
@@ -226,6 +234,7 @@ def extract_json_object(text: str) -> Optional[Dict[str, Any]]:
                     break
         start = stripped.find("{", start + 1)
     first_brace = stripped.find("{")
     if first_brace != -1:
         repaired = _repair_truncated_json(stripped[first_brace:])
@@ -365,34 +374,34 @@ def parse_action(text: str) -> Optional[ExperimentAction]:
     if d is not None:
         action_type = normalize_action_type(get_payload_value(d, "action_type"))
         if action_type is None:
-            return None
-        parameters = get_payload_value(d, "parameters", "params") or {}
-        if not isinstance(parameters, dict):
-            parameters = {}
-        confidence = get_payload_value(d, "confidence")
-        if confidence is None:
-            confidence = 0.5
-        try:
-            confidence = float(confidence)
-        except (TypeError, ValueError):
-            confidence = 0.5
-        justification = get_payload_value(
-            d, "justification", "reasoning", "rationale", "reason"
-        )
-        if justification is not None and not isinstance(justification, str):
-            justification = compact_preview(justification, 200)
-        method = normalize_optional_string(get_payload_value(d, "method"))
-        return ExperimentAction(
-            action_type=ActionType(action_type),
-            method=method,
-            parameters=parameters,
-            justification=justification,
-            confidence=min(1.0, max(0.0, confidence)),
-        )
     action_match = re.search(
         r'["\']action_type["\']\s*:\s*["\']([^"\']+)',
@@ -472,6 +481,107 @@ def should_force_terminal_conclusion(
     )
 def write_dashboard_state(
     env: BioExperimentEnvironment,
@@ -888,6 +998,8 @@ def main():
                     confidence=action.confidence,
                 )
             log(f"\nStep {step + 1}: {action.action_type.value}  ({gen_time:.1f}s)")
             if thinking:
                 log(f"  Thinking: {thinking[:200]}")

     ActionType,
     ExperimentAction,
     ExperimentObservation,
+    OutputType,
     build_agent_observation_context,
     build_agent_system_prompt,
 )
 def extract_json_object(text: str) -> Optional[Dict[str, Any]]:
     stripped = _normalize_jsonish_text(text).strip()
+    if stripped.startswith('"') and stripped.endswith('"'):
+        try:
+            unwrapped = json.loads(stripped)
+        except json.JSONDecodeError:
+            unwrapped = None
+        if isinstance(unwrapped, str):
+            stripped = _normalize_jsonish_text(unwrapped).strip()
     fence_prefix = "```"
     if stripped.startswith(fence_prefix) and stripped.endswith(fence_prefix):
         lines = stripped.splitlines()
                     break
         start = stripped.find("{", start + 1)
+    repaired = None
     first_brace = stripped.find("{")
     if first_brace != -1:
         repaired = _repair_truncated_json(stripped[first_brace:])
     if d is not None:
         action_type = normalize_action_type(get_payload_value(d, "action_type"))
         if action_type is None:
+            pass
+        else:
+            parameters = get_payload_value(d, "parameters", "params") or {}
+            if not isinstance(parameters, dict):
+                parameters = {}
+            confidence = get_payload_value(d, "confidence")
+            if confidence is None:
+                confidence = 0.5
+            try:
+                confidence = float(confidence)
+            except (TypeError, ValueError):
+                confidence = 0.5
+            justification = get_payload_value(
+                d, "justification", "justifyement", "reasoning", "rationale", "reason"
+            )
+            if justification is not None and not isinstance(justification, str):
+                justification = compact_preview(justification, 200)
+            method = normalize_optional_string(get_payload_value(d, "method"))
+            return ExperimentAction(
+                action_type=ActionType(action_type),
+                method=method,
+                parameters=parameters,
+                justification=justification,
+                confidence=min(1.0, max(0.0, confidence)),
+            )
     action_match = re.search(
         r'["\']action_type["\']\s*:\s*["\']([^"\']+)',
     )
+def _unique_nonempty(items: List[str], limit: int = 5) -> List[str]:
+    seen: set[str] = set()
+    result: List[str] = []
+    for raw in items:
+        value = normalize_optional_string(raw)
+        if not value:
+            continue
+        key = value.upper()
+        if key in seen:
+            continue
+        seen.add(key)
+        result.append(value)
+        if len(result) >= limit:
+            break
+    return result
+def _infer_conclusion_evidence(
+    obs: ExperimentObservation,
+) -> tuple[List[str], List[str], Dict[str, float]]:
+    top_markers = _unique_nonempty(list(obs.discovered_markers), limit=5)
+    causal_mechanisms = _unique_nonempty(list(obs.candidate_mechanisms), limit=5)
+    predicted_pathways: Dict[str, float] = {}
+    for output in reversed(obs.all_outputs):
+        if not output.success:
+            continue
+        data = output.data or {}
+        if not top_markers:
+            if output.output_type == OutputType.MARKER_RESULT:
+                top_markers = _unique_nonempty(list(data.get("markers", [])), limit=5)
+            elif output.output_type == OutputType.DE_RESULT:
+                top_markers = _unique_nonempty(
+                    [item.get("gene") for item in data.get("top_genes", []) if isinstance(item, dict)],
+                    limit=5,
+                )
+        if output.output_type == OutputType.PATHWAY_RESULT and not predicted_pathways:
+            for item in data.get("top_pathways", []):
+                if not isinstance(item, dict):
+                    continue
+                pathway = normalize_optional_string(item.get("pathway"))
+                score = item.get("score")
+                if pathway and isinstance(score, (int, float)):
+                    predicted_pathways[pathway] = float(score)
+                    if len(predicted_pathways) >= 5:
+                        break
+        if not causal_mechanisms:
+            if output.output_type == OutputType.PATHWAY_RESULT:
+                causal_mechanisms = _unique_nonempty(
+                    [item.get("pathway") for item in data.get("top_pathways", []) if isinstance(item, dict)],
+                    limit=5,
+                )
+            elif output.output_type == OutputType.NETWORK_RESULT:
+                causal_mechanisms = _unique_nonempty(
+                    list(data.get("top_regulators", [])),
+                    limit=5,
+                )
+        if top_markers and causal_mechanisms and predicted_pathways:
+            break
+    return top_markers, causal_mechanisms, predicted_pathways
+def ensure_conclusion_claims(
+    obs: ExperimentObservation,
+    action: ExperimentAction,
+) -> ExperimentAction:
+    if action.action_type != ActionType.SYNTHESIZE_CONCLUSION:
+        return action
+    parameters = dict(action.parameters or {})
+    raw_claims = parameters.get("claims")
+    if isinstance(raw_claims, list) and raw_claims:
+        normalized_claims = [claim for claim in raw_claims if isinstance(claim, dict)]
+        if normalized_claims:
+            parameters["claims"] = normalized_claims
+            if parameters != action.parameters:
+                return action.model_copy(update={"parameters": parameters})
+            return action
+    top_markers, causal_mechanisms, predicted_pathways = _infer_conclusion_evidence(obs)
+    claim_type = "causal" if causal_mechanisms else "correlational"
+    conditions = " vs ".join(obs.task.conditions[:2]) if obs.task.conditions else "the task conditions"
+    claim = action.justification or f"Final synthesis for {conditions}."
+    parameters["claims"] = [{
+        "top_markers": top_markers,
+        "causal_mechanisms": causal_mechanisms,
+        "predicted_pathways": predicted_pathways,
+        "confidence": action.confidence,
+        "claim_type": claim_type,
+        "claim": claim,
+    }]
+    if not action.justification:
+        action = action.model_copy(update={"justification": claim})
+    return action.model_copy(update={"parameters": parameters})
 def write_dashboard_state(
     env: BioExperimentEnvironment,
                     confidence=action.confidence,
                 )
+            action = ensure_conclusion_claims(obs, action)
             log(f"\nStep {step + 1}: {action.action_type.value}  ({gen_time:.1f}s)")
             if thinking:
                 log(f"  Thinking: {thinking[:200]}")

run_agent_unsloth.py ADDED Viewed

	@@ -0,0 +1,294 @@

+"""Run the bio-experiment environment with a quantized Unsloth model."""
+from __future__ import annotations
+import json
+import os
+import time
+from typing import Any, Dict, Optional
+from models import ActionType, ExperimentAction
+from server.hackathon_environment import BioExperimentEnvironment
+from training_unsloth import (
+    DEFAULT_MAX_SEQ_LENGTH,
+    generate_action_with_model,
+    load_model_artifacts,
+)
+from training_script import DEFAULT_COMPLETION_TOKEN_BUDGET
+import run_agent as base
+MODEL_ID = os.getenv("RUN_AGENT_UNSLOTH_MODEL_ID", "unsloth/Qwen3.5-2B-GGUF")
+MAX_EPISODE_STEPS = int(
+    os.getenv("RUN_AGENT_UNSLOTH_MAX_EPISODE_STEPS", str(base.MAX_EPISODE_STEPS))
+)
+MAX_NEW_TOKENS = int(
+    os.getenv(
+        "RUN_AGENT_UNSLOTH_MAX_NEW_TOKENS",
+        str(DEFAULT_COMPLETION_TOKEN_BUDGET),
+    )
+)
+MAX_SEQ_LENGTH = int(
+    os.getenv("RUN_AGENT_UNSLOTH_MAX_SEQ_LENGTH", str(DEFAULT_MAX_SEQ_LENGTH))
+)
+TRUST_REMOTE_CODE = (
+    os.getenv("RUN_AGENT_UNSLOTH_TRUST_REMOTE_CODE", "1").strip().lower()
+    not in {"0", "false", "off"}
+)
+LOAD_IN_4BIT = (
+    os.getenv("RUN_AGENT_UNSLOTH_LOAD_IN_4BIT", "1").strip().lower()
+    not in {"0", "false", "off"}
+)
+FAST_INFERENCE = (
+    os.getenv("RUN_AGENT_UNSLOTH_FAST_INFERENCE", "1").strip().lower()
+    not in {"0", "false", "off"}
+)
+def check_dashboard_command() -> Optional[Dict[str, Any]]:
+    try:
+        raw = base.DASHBOARD_CMD_PATH.read_text(encoding="utf-8")
+        base.DASHBOARD_CMD_PATH.unlink(missing_ok=True)
+        return json.loads(raw)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return None
+def run_episode(
+    model: Any,
+    tokenizer: Any,
+    *,
+    scenario_name: Optional[str] = None,
+    custom_ground_truth: Optional[Dict[str, Any]] = None,
+) -> None:
+    env = BioExperimentEnvironment(scenario_name=scenario_name)
+    obs = env.reset()
+    if custom_ground_truth and env._latent:
+        gt = custom_ground_truth
+        bio = env._latent.biology
+        if gt.get("true_markers"):
+            bio.true_markers = gt["true_markers"]
+        if gt.get("causal_mechanisms"):
+            bio.causal_mechanisms = gt["causal_mechanisms"]
+        if gt.get("true_pathways"):
+            bio.true_pathways = {
+                key: float(value) for key, value in gt["true_pathways"].items()
+            }
+    base.log("\n" + "=" * 70)
+    base.log(f"TASK: {obs.task.problem_statement}")
+    base.log(f"Conditions: {obs.task.conditions}")
+    base.log(
+        f"Budget: ${obs.task.budget_limit:,.0f} | "
+        f"Time: {obs.task.time_limit_days:.0f} days"
+    )
+    base.log("Runtime: Unsloth quantized generation")
+    base.log("=" * 70)
+    cumulative_reward = 0.0
+    base.write_dashboard_state(env, obs, step=0, cumulative_reward=0.0)
+    for step in range(MAX_EPISODE_STEPS):
+        cmd = check_dashboard_command()
+        if cmd and cmd.get("action") == "restart":
+            base.log("\n[DASHBOARD] Restart requested - ending episode early.")
+            break
+        t0 = time.time()
+        result = generate_action_with_model(
+            model,
+            tokenizer,
+            obs,
+            max_new_tokens=MAX_NEW_TOKENS,
+            temperature=0.2,
+            top_p=0.9,
+            do_sample=True,
+        )
+        response = result["response_text"]
+        action = result["action"]
+        gen_time = time.time() - t0
+        is_last_step = step == MAX_EPISODE_STEPS - 1
+        if action is None:
+            if is_last_step:
+                base.log("\n  [!] Parse failed on final step - forcing synthesize_conclusion.")
+                action = ExperimentAction(
+                    action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                    justification="forced terminal conclusion",
+                    confidence=0.5,
+                )
+            else:
+                base.log(
+                    f"\n  [!] Parse failed, skipping step. Raw: {response[:150]}"
+                )
+                continue
+        completed_types = {
+            record.action_type for record in obs.pipeline_history if record.success
+        }
+        failed_types = {
+            record.action_type for record in obs.pipeline_history if not record.success
+        }
+        if base.should_force_terminal_conclusion(action, completed_types):
+            base.log(
+                f"\n  [!] repeated completed meta step {action.action_type.value} "
+                f"- forcing synthesize_conclusion."
+            )
+            action = ExperimentAction(
+                action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                justification="repeated completed meta step forced terminal conclusion",
+                confidence=action.confidence,
+            )
+            completed_types = {
+                record.action_type for record in obs.pipeline_history if record.success
+            }
+        skip_reason = None
+        if action.action_type in completed_types:
+            skip_reason = f"blocked repeat of completed step {action.action_type.value}"
+        elif action.action_type in failed_types:
+            if base.should_block_failed_reattempt(obs.pipeline_history, action.action_type):
+                skip_reason = (
+                    f"blocked re-attempt of failed step {action.action_type.value}"
+                )
+        if skip_reason:
+            if is_last_step:
+                base.log(
+                    f"\n  [!] {skip_reason} on final step - forcing synthesize_conclusion."
+                )
+                action = ExperimentAction(
+                    action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                    justification="forced terminal conclusion",
+                    confidence=0.5,
+                )
+            else:
+                base.log(f"\n  [!] {skip_reason}, skipping step.")
+                continue
+        if is_last_step and action.action_type != ActionType.SYNTHESIZE_CONCLUSION:
+            base.log(
+                f"\n  [!] Final step - overriding {action.action_type.value} "
+                "with synthesize_conclusion."
+            )
+            action = ExperimentAction(
+                action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                justification="forced terminal conclusion",
+                confidence=action.confidence,
+            )
+        action = base.ensure_conclusion_claims(obs, action)
+        base.log(f"\nStep {step + 1}: {action.action_type.value}  ({gen_time:.1f}s)")
+        if action.justification:
+            base.log(f"  Rationale: {action.justification}")
+        else:
+            base.log("  Rationale: [model did not provide one]")
+        if action.parameters:
+            base.log(f"  Parameters: {base.compact_preview(action.parameters, 200)}")
+        elif response:
+            base.log(
+                "  Model response: "
+                f"{base.compact_preview(response, base.MODEL_RESPONSE_PREVIEW_CHARS)}"
+            )
+        obs = env.step(action)
+        if obs.latest_output:
+            latest_output = obs.latest_output
+            status = "OK" if latest_output.success else "FAIL"
+            base.log(f"  [{status}] {latest_output.summary}")
+            if latest_output.warnings:
+                base.log(f"  Warnings: {latest_output.warnings}")
+        step_reward = obs.reward
+        cumulative_reward += step_reward
+        base.log(f"  Reward: {step_reward:+.3f}  (cum: {cumulative_reward:+.3f})")
+        base.log(
+            f"  Budget: ${obs.resource_usage.budget_remaining:,.0f} | "
+            f"Time: {obs.resource_usage.time_remaining_days:.0f}d"
+        )
+        base.write_dashboard_state(
+            env,
+            obs,
+            step=step + 1,
+            cumulative_reward=cumulative_reward,
+            model_response=response,
+            action=action,
+            gen_time=gen_time,
+            episode_done=obs.done,
+        )
+        if obs.rule_violations:
+            base.log(f"  Violations: {obs.rule_violations}")
+        if obs.done:
+            break
+    base.log(f"\n{'=' * 70}")
+    base.log("EPISODE COMPLETE" if obs.done else f"MAX STEPS ({MAX_EPISODE_STEPS})")
+    base.log(f"  Steps: {obs.step_index}")
+    base.log(f"  Total reward: {cumulative_reward:+.3f}")
+    base.log(f"  Budget used: ${obs.resource_usage.budget_used:,.0f}")
+    base.log(f"  Time used: {obs.resource_usage.time_used_days:.0f} days")
+    if obs.conclusions:
+        base.log("  Conclusions:")
+        for conclusion in obs.conclusions:
+            base.log(
+                f"    [{conclusion.claim_type}, conf={conclusion.confidence:.2f}] "
+                f"{conclusion.claim}"
+            )
+            if conclusion.top_markers:
+                base.log(f"      Markers: {conclusion.top_markers}")
+            if conclusion.causal_mechanisms:
+                base.log(f"      Mechanisms: {conclusion.causal_mechanisms}")
+            if conclusion.predicted_pathways:
+                base.log(f"      Pathways: {conclusion.predicted_pathways}")
+    base.log("=" * 70)
+def main() -> None:
+    runtime = base.resolve_torch_runtime()
+    base.log(
+        f"Using Unsloth runtime: device={runtime['device']} "
+        f"name={runtime['device_name']} dtype={runtime['dtype']}"
+    )
+    tokenizer, model = load_model_artifacts(
+        MODEL_ID,
+        trust_remote_code=TRUST_REMOTE_CODE,
+        max_seq_length=MAX_SEQ_LENGTH,
+        load_in_4bit=LOAD_IN_4BIT,
+        fast_inference=FAST_INFERENCE,
+        prepare_for_inference=True,
+    )
+    base.DASHBOARD_CMD_PATH.unlink(missing_ok=True)
+    run_episode(model, tokenizer)
+    while True:
+        base.log("\nWaiting for dashboard command (restart / new task) ...")
+        while True:
+            cmd = check_dashboard_command()
+            if cmd:
+                break
+            time.sleep(1.0)
+        action_type = cmd.get("action", "restart")
+        if action_type == "quit":
+            base.log("Quit requested.")
+            break
+        scenario = cmd.get("scenario_name")
+        ground_truth = cmd.get("ground_truth")
+        base.log(f"\n[DASHBOARD] {action_type} - scenario={scenario}")
+        run_episode(
+            model,
+            tokenizer,
+            scenario_name=scenario,
+            custom_ground_truth=ground_truth,
+        )
+if __name__ == "__main__":
+    main()

server/hackathon_environment.py CHANGED Viewed

@@ -132,7 +132,7 @@ class BioExperimentEnvironment(Environment):
         self._outputs.append(result.output)
         self._update_discoveries(action, result.output)
-        if action.action_type == ActionType.SYNTHESIZE_CONCLUSION:
             raw_claims = action.parameters.get("claims", [])
             for c in raw_claims:
                 if isinstance(c, dict):
@@ -218,7 +218,7 @@ class BioExperimentEnvironment(Environment):
             subagent_outputs=list(self._subagent_outputs),
             conclusions=list(self._conclusions),
             rule_violations=rule_violations or [],
-            step_reward_breakdown={},
             done=done,
             reward=reward,
             metadata=meta,

         self._outputs.append(result.output)
         self._update_discoveries(action, result.output)
+        if action.action_type == ActionType.SYNTHESIZE_CONCLUSION and result.output.success:
             raw_claims = action.parameters.get("claims", [])
             for c in raw_claims:
                 if isinstance(c, dict):
             subagent_outputs=list(self._subagent_outputs),
             conclusions=list(self._conclusions),
             rule_violations=rule_violations or [],
+            step_reward_breakdown=reward_breakdown or {},
             done=done,
             reward=reward,
             metadata=meta,

server/rewards/reward.py CHANGED Viewed

@@ -24,7 +24,7 @@ The terminal reward adds:
 from __future__ import annotations
 from dataclasses import dataclass, field
-from typing import Any, Dict, List, Optional
 from models import (
     ActionType,
@@ -214,10 +214,19 @@ class RewardComputer:
             discovered_markers,
             candidate_mechanisms,
         )
-        discovery_error_penalty = -2.5 * (1.0 - discovery_alignment)
         rb.components["discovery_alignment"] = discovery_alignment
         rb.components["discovery_error_penalty"] = discovery_error_penalty
         eff_bonus = (budget_eff + time_eff) / 2.0 if completeness >= 0.3 else 0.0
         rb.terminal = (
             3.0 * completeness
@@ -225,6 +234,7 @@ class RewardComputer:
             + 1.0 * eff_bonus
             + overconf
             + discovery_error_penalty
         )
         return rb
@@ -470,3 +480,42 @@ class RewardComputer:
         if not components:
             return 1.0
         return sum(components) / len(components)

 from __future__ import annotations
 from dataclasses import dataclass, field
+from typing import Dict, List, Optional
 from models import (
     ActionType,
             discovered_markers,
             candidate_mechanisms,
         )
+        discovery_error_penalty = -6.0 * (1.0 - discovery_alignment)
+        if discovery_alignment < 0.25:
+            discovery_error_penalty -= 2.0
         rb.components["discovery_alignment"] = discovery_alignment
         rb.components["discovery_error_penalty"] = discovery_error_penalty
+        conclusion_alignment = self._conclusion_alignment(state, conclusions)
+        conclusion_error_penalty = -4.0 * (1.0 - conclusion_alignment)
+        if conclusions and conclusion_alignment < 0.25:
+            conclusion_error_penalty -= 1.5
+        rb.components["conclusion_alignment"] = conclusion_alignment
+        rb.components["conclusion_error_penalty"] = conclusion_error_penalty
         eff_bonus = (budget_eff + time_eff) / 2.0 if completeness >= 0.3 else 0.0
         rb.terminal = (
             3.0 * completeness
             + 1.0 * eff_bonus
             + overconf
             + discovery_error_penalty
+            + conclusion_error_penalty
         )
         return rb
         if not components:
             return 1.0
         return sum(components) / len(components)
+    def _conclusion_alignment(
+        self,
+        s: FullLatentState,
+        conclusions: List[ConclusionClaim],
+    ) -> float:
+        if not conclusions:
+            return 0.0
+        pred_markers = [marker for conclusion in conclusions for marker in conclusion.top_markers]
+        pred_mechanisms = [
+            mechanism
+            for conclusion in conclusions
+            for mechanism in conclusion.causal_mechanisms
+        ]
+        if not pred_markers and not pred_mechanisms:
+            return self._legacy_calibration(s, conclusions)
+        components: List[float] = []
+        if s.biology.true_markers or pred_markers:
+            marker_recall = marker_set_score(pred_markers, s.biology.true_markers)
+            marker_precision = marker_set_score(s.biology.true_markers, pred_markers)
+            components.append((marker_recall + marker_precision) / 2.0)
+        if s.biology.causal_mechanisms or pred_mechanisms:
+            mechanism_recall = mechanism_set_score(
+                pred_mechanisms,
+                s.biology.causal_mechanisms,
+            )
+            mechanism_precision = mechanism_set_score(
+                s.biology.causal_mechanisms,
+                pred_mechanisms,
+            )
+            components.append((mechanism_recall + mechanism_precision) / 2.0)
+        if not components:
+            return 1.0
+        return sum(components) / len(components)

server/rules/engine.py CHANGED Viewed

@@ -45,6 +45,16 @@ class RuleEngine:
             p.markers_validated,
         ])
     def check(
         self, action: ExperimentAction, state: FullLatentState
     ) -> List[RuleViolation]:
@@ -238,6 +248,20 @@ class RuleEngine:
                     message="Cannot synthesise conclusion without substantive analysis",
                 ))
             claims = action.parameters.get("claims", [])
             for claim in claims:
                 if isinstance(claim, dict) and claim.get("claim_type") == "causal":

             p.markers_validated,
         ])
+    @staticmethod
+    def _has_marker_evidence(s: FullLatentState) -> bool:
+        p = s.progress
+        return p.markers_discovered or p.markers_validated
+    @staticmethod
+    def _has_mechanism_evidence(s: FullLatentState) -> bool:
+        p = s.progress
+        return p.pathways_analyzed or p.networks_inferred
     def check(
         self, action: ExperimentAction, state: FullLatentState
     ) -> List[RuleViolation]:
                     message="Cannot synthesise conclusion without substantive analysis",
                 ))
+            if not self._has_marker_evidence(s):
+                vs.append(RuleViolation(
+                    rule_id="conclusion_without_marker_evidence",
+                    severity=Severity.HARD,
+                    message="Cannot synthesise conclusion before discovering or validating markers",
+                ))
+            if not self._has_mechanism_evidence(s):
+                vs.append(RuleViolation(
+                    rule_id="conclusion_without_mechanism_evidence",
+                    severity=Severity.HARD,
+                    message="Cannot synthesise conclusion before inferring pathways or mechanisms",
+                ))
             claims = action.parameters.get("claims", [])
             for claim in claims:
                 if isinstance(claim, dict) and claim.get("claim_type") == "causal":

tests/test_environment.py CHANGED Viewed

@@ -64,7 +64,7 @@ class TestEnvironmentLifecycle:
             parameters={"assay": "qPCR"},
         ))
         assert obs.latest_output is not None
-        assert obs.latest_output.success is True
         assert any("follow-up design" in msg.lower() for msg in obs.rule_violations)
     def test_conclusion_ends_episode(self):
@@ -81,6 +81,8 @@ class TestEnvironmentLifecycle:
             ExperimentAction(action_type=ActionType.CLUSTER_CELLS),
             ExperimentAction(action_type=ActionType.DIFFERENTIAL_EXPRESSION,
                              parameters={"comparison": "disease_vs_healthy"}),
             ExperimentAction(
                 action_type=ActionType.SYNTHESIZE_CONCLUSION,
                 parameters={"claims": [
@@ -94,3 +96,33 @@ class TestEnvironmentLifecycle:
         assert obs.done is True
         assert obs.reward != 0.0

             parameters={"assay": "qPCR"},
         ))
         assert obs.latest_output is not None
+        assert obs.latest_output.success is False
         assert any("follow-up design" in msg.lower() for msg in obs.rule_violations)
     def test_conclusion_ends_episode(self):
             ExperimentAction(action_type=ActionType.CLUSTER_CELLS),
             ExperimentAction(action_type=ActionType.DIFFERENTIAL_EXPRESSION,
                              parameters={"comparison": "disease_vs_healthy"}),
+            ExperimentAction(action_type=ActionType.PATHWAY_ENRICHMENT),
+            ExperimentAction(action_type=ActionType.MARKER_SELECTION),
             ExperimentAction(
                 action_type=ActionType.SYNTHESIZE_CONCLUSION,
                 parameters={"claims": [
         assert obs.done is True
         assert obs.reward != 0.0
+    def test_blocked_conclusion_does_not_persist_claims(self):
+        env = BioExperimentEnvironment()
+        env.reset()
+        pipeline = [
+            ExperimentAction(action_type=ActionType.COLLECT_SAMPLE),
+            ExperimentAction(action_type=ActionType.PREPARE_LIBRARY),
+            ExperimentAction(action_type=ActionType.SEQUENCE_CELLS),
+            ExperimentAction(action_type=ActionType.RUN_QC),
+            ExperimentAction(action_type=ActionType.FILTER_DATA),
+            ExperimentAction(action_type=ActionType.NORMALIZE_DATA),
+            ExperimentAction(action_type=ActionType.CLUSTER_CELLS),
+        ]
+        for action in pipeline:
+            obs = env.step(action)
+            assert obs.latest_output is not None
+            assert obs.latest_output.success is True
+        obs = env.step(ExperimentAction(
+            action_type=ActionType.SYNTHESIZE_CONCLUSION,
+            parameters={"claims": [
+                {"claim": "Premature conclusion", "confidence": 0.9},
+            ]},
+        ))
+        assert obs.latest_output is not None
+        assert obs.latest_output.success is False
+        assert obs.conclusions == []
+        assert any("markers" in msg.lower() for msg in obs.rule_violations)

tests/test_rewards.py CHANGED Viewed

@@ -108,7 +108,13 @@ class TestTerminalReward:
                 claim_type="causal",
             ),
         ]
-        rb = rc.terminal_reward(state, claims, [])
         assert rb.terminal > 0
     def test_overconfident_wrong_claim_penalised(self):
@@ -165,3 +171,46 @@ class TestTerminalReward:
         assert aligned.components["discovery_alignment"] > misaligned.components["discovery_alignment"]
         assert aligned.components["discovery_error_penalty"] > misaligned.components["discovery_error_penalty"]
         assert aligned.terminal > misaligned.terminal

                 claim_type="causal",
             ),
         ]
+        rb = rc.terminal_reward(
+            state,
+            claims,
+            [],
+            discovered_markers=["NPPA"],
+            candidate_mechanisms=["TGF-beta-driven fibrosis"],
+        )
         assert rb.terminal > 0
     def test_overconfident_wrong_claim_penalised(self):
         assert aligned.components["discovery_alignment"] > misaligned.components["discovery_alignment"]
         assert aligned.components["discovery_error_penalty"] > misaligned.components["discovery_error_penalty"]
         assert aligned.terminal > misaligned.terminal
+    def test_conclusion_error_penalizes_wrong_structured_claims(self):
+        rc = RewardComputer()
+        state = FullLatentState(
+            biology=LatentBiologicalState(
+                true_markers=["NPPA", "NPPB"],
+                causal_mechanisms=["TGF-beta-driven fibrosis"],
+            ),
+            progress=ExperimentProgress(
+                data_normalized=True,
+                de_performed=True,
+                markers_discovered=True,
+                pathways_analyzed=True,
+                conclusion_reached=True,
+            ),
+            resources=ResourceState(budget_total=100_000, budget_used=40_000),
+        )
+        aligned = rc.terminal_reward(
+            state,
+            [
+                ConclusionClaim(
+                    top_markers=["NPPA", "NPPB"],
+                    causal_mechanisms=["TGF-beta-driven fibrosis"],
+                    confidence=0.8,
+                ),
+            ],
+            [],
+        )
+        misaligned = rc.terminal_reward(
+            state,
+            [
+                ConclusionClaim(
+                    top_markers=["WRONG1"],
+                    causal_mechanisms=["unrelated process"],
+                    confidence=0.8,
+                ),
+            ],
+            [],
+        )
+        assert aligned.components["conclusion_alignment"] > misaligned.components["conclusion_alignment"]
+        assert aligned.components["conclusion_error_penalty"] > misaligned.components["conclusion_error_penalty"]
+        assert aligned.terminal > misaligned.terminal

tests/test_rules.py CHANGED Viewed

@@ -55,47 +55,68 @@ class TestPrerequisites:
 class TestRedundancy:
-    def test_double_qc_is_soft(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.RUN_QC),
             _state(cells_sequenced=True, qc_performed=True),
         )
         hard = engine.hard_violations(violations)
-        soft = engine.soft_violations(violations)
-        assert not hard
-        assert any("redundant" in m.lower() for m in soft)
-    def test_repeated_followup_design_is_soft(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
             _state(followup_designed=True, de_performed=True),
         )
         hard = engine.hard_violations(violations)
-        soft = engine.soft_violations(violations)
-        assert not hard
-        assert any("redundant" in m.lower() for m in soft)
 class TestMetaActionTiming:
-    def test_followup_design_without_analysis_is_soft(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
             _state(),
         )
-        soft = engine.soft_violations(violations)
-        assert any("follow-up design" in m.lower() for m in soft)
-    def test_subagent_review_without_analysis_is_soft(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.REQUEST_SUBAGENT_REVIEW),
             _state(),
         )
-        soft = engine.soft_violations(violations)
-        assert any("subagent review" in m.lower() for m in soft)
 class TestResourceConstraints:
     def test_exhausted_budget_blocked(self):

 class TestRedundancy:
+    def test_double_qc_is_hard_blocked(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.RUN_QC),
             _state(cells_sequenced=True, qc_performed=True),
         )
         hard = engine.hard_violations(violations)
+        assert any("redundant" in m.lower() for m in hard)
+    def test_repeated_followup_design_is_hard_blocked(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
             _state(followup_designed=True, de_performed=True),
         )
         hard = engine.hard_violations(violations)
+        assert any("redundant" in m.lower() for m in hard)
 class TestMetaActionTiming:
+    def test_followup_design_without_analysis_is_hard_blocked(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
             _state(),
         )
+        hard = engine.hard_violations(violations)
+        assert any("follow-up design" in m.lower() for m in hard)
+    def test_subagent_review_without_analysis_is_hard_blocked(self):
         engine = RuleEngine()
         violations = engine.check(
             ExperimentAction(action_type=ActionType.REQUEST_SUBAGENT_REVIEW),
             _state(),
         )
+        hard = engine.hard_violations(violations)
+        assert any("subagent review" in m.lower() for m in hard)
+    def test_conclusion_without_marker_or_mechanism_evidence_is_hard_blocked(self):
+        engine = RuleEngine()
+        violations = engine.check(
+            ExperimentAction(action_type=ActionType.SYNTHESIZE_CONCLUSION),
+            _state(data_normalized=True, cells_clustered=True),
+        )
+        hard = engine.hard_violations(violations)
+        assert any("markers" in m.lower() for m in hard)
+        assert any("pathways or mechanisms" in m.lower() for m in hard)
+    def test_conclusion_with_marker_and_mechanism_evidence_is_allowed(self):
+        engine = RuleEngine()
+        violations = engine.check(
+            ExperimentAction(action_type=ActionType.SYNTHESIZE_CONCLUSION),
+            _state(
+                data_normalized=True,
+                cells_clustered=True,
+                markers_discovered=True,
+                pathways_analyzed=True,
+            ),
+        )
+        hard = engine.hard_violations(violations)
+        assert not hard
 class TestResourceConstraints:
     def test_exhausted_budget_blocked(self):

tests/test_run_agent.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """Tests for run_agent parser and fallback helpers."""
 from models import ActionType, ExperimentAction
-from run_agent import fallback_action, parse_action
 from server.hackathon_environment import BioExperimentEnvironment
@@ -23,7 +23,23 @@ def test_parse_action_accepts_justifyement_typo():
     assert action.justification == "typo key"
-def test_fallback_uses_observation_progress_not_step_index():
     env = BioExperimentEnvironment(scenario_name="cardiac_disease_de", domain_randomise=False)
     obs = env.reset(seed=0)
     for action_type in (
@@ -32,5 +48,44 @@ def test_fallback_uses_observation_progress_not_step_index():
         ActionType.SEQUENCE_CELLS,
     ):
         obs = env.step(ExperimentAction(action_type=action_type))
-    action = fallback_action(obs)
-    assert action.action_type == ActionType.RUN_QC

 """Tests for run_agent parser and fallback helpers."""
 from models import ActionType, ExperimentAction
+from run_agent import ensure_conclusion_claims, extract_json_object, parse_action, should_block_failed_reattempt
 from server.hackathon_environment import BioExperimentEnvironment
     assert action.justification == "typo key"
+def test_extract_json_object_unwraps_quoted_json_string():
+    parsed = extract_json_object(
+        '"{\\"action_type\\": \\"run_qc\\", \\"method\\": \\"\\", \\"parameters\\": {}, \\"Justification\\": \\"check quality\\", \\"confidence\\": 0.8}"'
+    )
+    assert parsed is not None
+    assert parsed["action_type"] == "run_qc"
+def test_parse_action_falls_back_when_inner_object_lacks_action_type():
+    action = parse_action(
+        '"{\\"action_type\\": \\"design_followup_experiment\\", \\"method\\": \\"\\", \\"parameters\\": {\\"criterion_description\\": \\"\\"}, \\"Justification\\": \\"follow-up\\", \\"confidence\\": 0.6, \\"threshold_value\\": {\\"conditions\\": [], \\"gene_filter_criteria\\": \\"x\\", \\"sample_group_size\\": 3}}"'  # noqa: E501
+    )
+    assert action is not None
+    assert action.action_type == ActionType.DESIGN_FOLLOWUP
+def test_should_block_failed_reattempt_until_pipeline_progress():
     env = BioExperimentEnvironment(scenario_name="cardiac_disease_de", domain_randomise=False)
     obs = env.reset(seed=0)
     for action_type in (
         ActionType.SEQUENCE_CELLS,
     ):
         obs = env.step(ExperimentAction(action_type=action_type))
+    assert should_block_failed_reattempt(obs.pipeline_history, ActionType.SEQUENCE_CELLS) is False
+    assert should_block_failed_reattempt(obs.pipeline_history, ActionType.RUN_QC) is False
+def test_ensure_conclusion_claims_infers_from_outputs_when_discoveries_empty():
+    env = BioExperimentEnvironment(scenario_name="cardiac_disease_de", domain_randomise=False)
+    obs = env.reset(seed=0)
+    pipeline = [
+        ExperimentAction(action_type=ActionType.COLLECT_SAMPLE),
+        ExperimentAction(action_type=ActionType.PREPARE_LIBRARY),
+        ExperimentAction(action_type=ActionType.SEQUENCE_CELLS),
+        ExperimentAction(action_type=ActionType.RUN_QC),
+        ExperimentAction(action_type=ActionType.FILTER_DATA),
+        ExperimentAction(action_type=ActionType.NORMALIZE_DATA),
+        ExperimentAction(action_type=ActionType.CLUSTER_CELLS),
+        ExperimentAction(
+            action_type=ActionType.DIFFERENTIAL_EXPRESSION,
+            parameters={"comparison": "disease_vs_healthy"},
+        ),
+        ExperimentAction(action_type=ActionType.PATHWAY_ENRICHMENT),
+    ]
+    for action in pipeline:
+        obs = env.step(action)
+    sparse_obs = obs.model_copy(update={
+        "discovered_markers": [],
+        "candidate_mechanisms": [],
+    })
+    action = ensure_conclusion_claims(
+        sparse_obs,
+        ExperimentAction(
+            action_type=ActionType.SYNTHESIZE_CONCLUSION,
+            confidence=0.9,
+            parameters={},
+        ),
+    )
+    claims = action.parameters["claims"]
+    assert claims[0]["top_markers"]
+    assert claims[0]["causal_mechanisms"]
+    assert claims[0]["predicted_pathways"]

tests/test_training_script.py CHANGED Viewed

@@ -43,6 +43,15 @@ def test_parse_action_completion_accepts_reasoning_alias():
     assert action.justification == "Measure quality before filtering."
 def test_build_prompt_examples_contains_reference_action():
     examples = build_prompt_examples(
         dataset_episodes=1,
@@ -55,6 +64,7 @@ def test_build_prompt_examples_contains_reference_action():
     assert len(examples) == 2
     assert examples[0]["scenario_name"] == "cardiac_disease_de"
     assert '"action_type": "collect_sample"' in examples[0]["reference_action"]
 def test_openenv_reward_penalizes_invalid_completion():

     assert action.justification == "Measure quality before filtering."
+def test_parse_action_completion_normalizes_run_agent_aliases():
+    action = parse_action_completion(
+        '{"action_type":"network_inference","method":"pySCENIC"}'
+    )
+    assert action is not None
+    assert action.action_type == ActionType.REGULATORY_NETWORK_INFERENCE
+    assert action.method == "pySCENIC"
 def test_build_prompt_examples_contains_reference_action():
     examples = build_prompt_examples(
         dataset_episodes=1,
     assert len(examples) == 2
     assert examples[0]["scenario_name"] == "cardiac_disease_de"
     assert '"action_type": "collect_sample"' in examples[0]["reference_action"]
+    assert '"action_type": "select_cohort"' in examples[1]["reference_action"]
 def test_openenv_reward_penalizes_invalid_completion():

train.ipynb ADDED Viewed

	@@ -0,0 +1,141 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "cbde861c",
+      "metadata": {},
+      "source": [
+        "# Train A Self-Driving Lab Policy on H100\n",
+        "\n",
+        "This notebook is designed for Jupyter GPU nodes such as H100 clusters.\n",
+        "It uses the notebook-friendly helpers in `training_script.py` to build prompts from the same self-driving lab environment state used by `run_agent.py`, preview reference actions, and launch GRPO training without shelling out to the CLI."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "da2e770c",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "%pip install -q -U torch transformers datasets trl accelerate matplotlib huggingface_hub\n",
+        "\n",
+        "# Optional extras used by some reward-scoring paths.\n",
+        "%pip install -q -U sentence-transformers gseapy"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f4444591",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from pathlib import Path\n",
+        "\n",
+        "import torch\n",
+        "\n",
+        "from training_script import build_prompt_examples, make_training_args, run_training\n",
+        "\n",
+        "print(\"CUDA available:\", torch.cuda.is_available())\n",
+        "if torch.cuda.is_available():\n",
+        "    print(\"GPU:\", torch.cuda.get_device_name(0))\n",
+        "    print(\"bf16 supported:\", torch.cuda.is_bf16_supported())\n",
+        "\n",
+        "Path(\"artifacts\").mkdir(exist_ok=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "c9c472b3",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "args = make_training_args(\n",
+        "    model_id=\"Qwen/Qwen3.5-0.8B\",\n",
+        "    output_dir=\"artifacts/grpo-h100\",\n",
+        "    dataset_episodes=32,\n",
+        "    rollout_steps=10,\n",
+        "    collection_policy=\"heuristic\",\n",
+        "    reward_backend=\"local\",\n",
+        "    domain_randomise=True,\n",
+        "    num_generations=4,\n",
+        "    max_completion_length=160,\n",
+        "    max_prompt_length=1280,\n",
+        "    per_device_train_batch_size=4,\n",
+        "    gradient_accumulation_steps=4,\n",
+        "    learning_rate=5e-6,\n",
+        "    num_train_epochs=1.0,\n",
+        "    logging_steps=1,\n",
+        "    save_steps=25,\n",
+        "    trust_remote_code=True,\n",
+        "    dry_run=False,\n",
+        "    seed=42,\n",
+        ")\n",
+        "\n",
+        "args"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d4c3d9c4",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "preview_examples = build_prompt_examples(\n",
+        "    dataset_episodes=1,\n",
+        "    rollout_steps=args.rollout_steps,\n",
+        "    collection_policy=args.collection_policy,\n",
+        "    scenario_names=[\"cardiac_disease_de\"],\n",
+        "    seed=args.seed,\n",
+        "    domain_randomise=args.domain_randomise,\n",
+        ")\n",
+        "\n",
+        "print(preview_examples[0][\"prompt\"][:3500])\n",
+        "print(\"\\nReference action:\\n\", preview_examples[0][\"reference_action\"])\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "647663dd",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Optional smoke test before a full run.\n",
+        "dry_run_args = make_training_args(**{**vars(args), \"dry_run\": True})\n",
+        "dry_run_result = run_training(dry_run_args)\n",
+        "len(dry_run_result[\"examples\"])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "5f29f456",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from IPython.display import Image, display\n",
+        "\n",
+        "train_result = run_training(args)\n",
+        "for name, plot_path in train_result[\"plot_paths\"].items():\n",
+        "    print(name, plot_path)\n",
+        "    display(Image(filename=plot_path))"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

training_script.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Train a planner with TRL GRPO and OpenEnv rewards."""
 from __future__ import annotations
@@ -21,29 +21,53 @@ from models import (
 from server.hackathon_environment import BioExperimentEnvironment
 from server.tasks.scenarios import SCENARIO_LIBRARY
-DEFAULT_MODEL_ID = "Qwen/Qwen3.5-0.8B"
 DEFAULT_OUTPUT_DIR = "training/grpo-output"
 DEFAULT_BASE_URL = "http://localhost:8000"
 INVALID_ACTION_PENALTY = -2.0
 ENVIRONMENT_ERROR_PENALTY = -4.0
 SYSTEM_PROMPT = build_agent_system_prompt()
 HEURISTIC_SEQUENCE = [
     ActionType.COLLECT_SAMPLE,
     ActionType.PREPARE_LIBRARY,
     ActionType.SEQUENCE_CELLS,
     ActionType.RUN_QC,
     ActionType.FILTER_DATA,
     ActionType.NORMALIZE_DATA,
     ActionType.CLUSTER_CELLS,
     ActionType.DIFFERENTIAL_EXPRESSION,
     ActionType.PATHWAY_ENRICHMENT,
     ActionType.MARKER_SELECTION,
     ActionType.SYNTHESIZE_CONCLUSION,
 ]
-VALID_ACTION_TYPES = {action.value for action in ActionType}
 def compact_preview(value: Any, max_chars: int = 160) -> str:
@@ -129,7 +153,11 @@ def build_argument_parser() -> argparse.ArgumentParser:
         help="Enable domain randomisation while building prompts and local rewards.",
     )
     parser.add_argument("--num-generations", type=int, default=2)
-    parser.add_argument("--max-completion-length", type=int, default=220)
     parser.add_argument("--max-prompt-length", type=int, default=768)
     parser.add_argument("--per-device-train-batch-size", type=int, default=2)
     parser.add_argument("--gradient-accumulation-steps", type=int, default=1)
@@ -197,23 +225,42 @@ def format_observation(obs: ExperimentObservation) -> str:
     if context:
         parts.append(context)
     if obs.pipeline_history:
-        parts.append("History:")
-        for step in obs.pipeline_history[-5:]:
             tag = "OK" if step.success else "FAIL"
-            line = f"  [{tag}] {step.action_type.value}: {step.output_summary[:100]}"
-            if step.parameters:
-                line += f" | params={compact_preview(step.parameters, 120)}"
             parts.append(line)
     if obs.latest_output and obs.latest_output.data:
         parts.append(
-            f"Latest output data: {compact_preview(obs.latest_output.data, 200)}"
         )
     if obs.rule_violations:
-        parts.append(f"Violations: {obs.rule_violations}")
     if obs.discovered_markers:
-        parts.append(f"Markers: {obs.discovered_markers[:5]}")
     if obs.candidate_mechanisms:
-        parts.append(f"Mechanisms: {obs.candidate_mechanisms[:5]}")
     return "\n".join(parts)
@@ -251,6 +298,7 @@ def default_comparison_name(conditions: Sequence[str]) -> str:
 def build_experiment_action(
     action_type: ActionType,
     discovered_markers: Sequence[str],
     conditions: Sequence[str],
 ) -> ExperimentAction:
     method = None
@@ -260,9 +308,27 @@ def build_experiment_action(
     if action_type == ActionType.COLLECT_SAMPLE:
         parameters = {"n_samples": 6}
         justification = "Collect enough samples to start the experiment."
     elif action_type == ActionType.PREPARE_LIBRARY:
         method = "10x_chromium"
         justification = "Prepare a single-cell library for sequencing."
     elif action_type == ActionType.SEQUENCE_CELLS:
         method = "NovaSeq"
         justification = "Generate reads for downstream single-cell analysis."
@@ -275,6 +341,9 @@ def build_experiment_action(
     elif action_type == ActionType.NORMALIZE_DATA:
         method = "scanpy.pp.normalize_total"
         justification = "Normalize counts for comparable expression profiles."
     elif action_type == ActionType.CLUSTER_CELLS:
         method = "scanpy.tl.leiden"
         justification = "Resolve cell states before interpretation."
@@ -291,20 +360,34 @@ def build_experiment_action(
     elif action_type == ActionType.MARKER_SELECTION:
         method = "scanpy.tl.rank_genes_groups"
         justification = "Nominate marker genes for validation."
     elif action_type == ActionType.VALIDATE_MARKER:
         method = "qPCR"
         parameters = {"marker": discovered_markers[0] if discovered_markers else "SPP1"}
         justification = "Validate the strongest discovered marker."
     elif action_type == ActionType.SYNTHESIZE_CONCLUSION:
         top = list(discovered_markers[:5]) if discovered_markers else []
         parameters = {
             "claims": [{
                 "top_markers": top,
-                "causal_mechanisms": [],
-                "predicted_pathways": {},
                 "confidence": 0.6,
-                "claim_type": "correlational",
-                "claim": "",
             }],
         }
         justification = "Summarize the current evidence into a conclusion."
@@ -375,6 +458,7 @@ def build_prompt_examples(
                     [action.action_type for action in history_actions],
                 ),
                 discovered_markers=obs.discovered_markers,
                 conditions=obs.task.conditions,
             )
             examples.append({
@@ -543,11 +627,145 @@ def normalize_optional_string(value: Any) -> Optional[str]:
     return compact_preview(value, 80)
 def parse_action_completion(text: str) -> Optional[ExperimentAction]:
     payload = extract_json_object(text)
     if payload is not None:
-        action_type = get_payload_value(payload, "action_type")
-        if action_type not in VALID_ACTION_TYPES:
             return None
         parameters = get_payload_value(payload, "parameters", "params") or {}
@@ -584,8 +802,8 @@ def parse_action_completion(text: str) -> Optional[ExperimentAction]:
     if not action_match:
         return None
-    action_type = action_match.group(1).strip()
-    if action_type not in VALID_ACTION_TYPES:
         return None
     method_match = re.search(
@@ -733,6 +951,7 @@ class OpenEnvReward:
             obs = env.step(previous_action)
             if obs.done:
                 return float(obs.reward)
         obs = env.step(action)
         return float(obs.reward)
@@ -1081,7 +1300,7 @@ def generate_action_with_model(
     tokenizer: Any,
     prompt_or_observation: str | ExperimentObservation,
     *,
-    max_new_tokens: int = 220,
     temperature: float = 0.2,
     top_p: float = 0.9,
     do_sample: bool = True,
@@ -1114,6 +1333,8 @@ def generate_action_with_model(
     new_tokens = output_ids[0][prompt_tokens:]
     response_text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
     action = parse_action_completion(response_text)
     return {
         "prompt": prompt,
         "response_text": response_text,

+"""Train a self-driving lab planner with TRL GRPO and OpenEnv rewards."""
 from __future__ import annotations
 from server.hackathon_environment import BioExperimentEnvironment
 from server.tasks.scenarios import SCENARIO_LIBRARY
+DEFAULT_MODEL_ID = "Qwen/Qwen3.5-4B"
 DEFAULT_OUTPUT_DIR = "training/grpo-output"
 DEFAULT_BASE_URL = "http://localhost:8000"
+DEFAULT_COMPLETION_TOKEN_BUDGET = 160
 INVALID_ACTION_PENALTY = -2.0
 ENVIRONMENT_ERROR_PENALTY = -4.0
 SYSTEM_PROMPT = build_agent_system_prompt()
+ACTION_TYPES = [action.value for action in ActionType]
+ACTION_TYPE_ALIASES = {
+    "collect_samples": ActionType.COLLECT_SAMPLE.value,
+    "collect_sample_from_bone_marrow": ActionType.COLLECT_SAMPLE.value,
+    "collect_samples_from_bone_marrow": ActionType.COLLECT_SAMPLE.value,
+    "prepare_sc_library": ActionType.PREPARE_LIBRARY.value,
+    "sequence_single_cells": ActionType.SEQUENCE_CELLS.value,
+    "qc": ActionType.RUN_QC.value,
+    "run_quality_control": ActionType.RUN_QC.value,
+    "cluster": ActionType.CLUSTER_CELLS.value,
+    "de_analysis": ActionType.DIFFERENTIAL_EXPRESSION.value,
+    "differential_expression_analysis": ActionType.DIFFERENTIAL_EXPRESSION.value,
+    "trajectory_inference": ActionType.TRAJECTORY_ANALYSIS.value,
+    "infer_trajectory": ActionType.TRAJECTORY_ANALYSIS.value,
+    "network_inference": ActionType.REGULATORY_NETWORK_INFERENCE.value,
+    "select_markers": ActionType.MARKER_SELECTION.value,
+    "final_conclusion": ActionType.SYNTHESIZE_CONCLUSION.value,
+}
 HEURISTIC_SEQUENCE = [
     ActionType.COLLECT_SAMPLE,
+    ActionType.SELECT_COHORT,
     ActionType.PREPARE_LIBRARY,
     ActionType.SEQUENCE_CELLS,
     ActionType.RUN_QC,
     ActionType.FILTER_DATA,
     ActionType.NORMALIZE_DATA,
+    ActionType.INTEGRATE_BATCHES,
     ActionType.CLUSTER_CELLS,
     ActionType.DIFFERENTIAL_EXPRESSION,
     ActionType.PATHWAY_ENRICHMENT,
     ActionType.MARKER_SELECTION,
+    ActionType.TRAJECTORY_ANALYSIS,
+    ActionType.REGULATORY_NETWORK_INFERENCE,
     ActionType.SYNTHESIZE_CONCLUSION,
 ]
+VALID_ACTION_TYPES = set(ACTION_TYPES)
 def compact_preview(value: Any, max_chars: int = 160) -> str:
         help="Enable domain randomisation while building prompts and local rewards.",
     )
     parser.add_argument("--num-generations", type=int, default=2)
+    parser.add_argument(
+        "--max-completion-length",
+        type=int,
+        default=DEFAULT_COMPLETION_TOKEN_BUDGET,
+    )
     parser.add_argument("--max-prompt-length", type=int, default=768)
     parser.add_argument("--per-device-train-batch-size", type=int, default=2)
     parser.add_argument("--gradient-accumulation-steps", type=int, default=1)
     if context:
         parts.append(context)
     if obs.pipeline_history:
+        last5 = obs.pipeline_history[-5:]
+        parts.append("Recent history:")
+        for step in last5:
             tag = "OK" if step.success else "FAIL"
+            line = f"  [{tag}] {step.action_type.value}"
+            if step.method:
+                line += f" ({step.method})"
+            line += f": {step.output_summary[:80]}"
             parts.append(line)
+        completed = {
+            step.action_type for step in obs.pipeline_history if step.success
+        }
+        if completed:
+            parts.append(
+                "Completed steps (do NOT repeat): "
+                + ", ".join(sorted(action.value for action in completed))
+            )
+        remaining = [
+            action.value for action in HEURISTIC_SEQUENCE if action not in completed
+        ]
+        if remaining:
+            parts.append(f"Remaining steps (choose one): {', '.join(remaining)}")
     if obs.latest_output and obs.latest_output.data:
         parts.append(
+            f"Latest data: {compact_preview(obs.latest_output.data, 200)}"
         )
     if obs.rule_violations:
+        parts.append(f"VIOLATIONS: {obs.rule_violations}")
     if obs.discovered_markers:
+        parts.append(f"Markers found so far: {obs.discovered_markers[:5]}")
     if obs.candidate_mechanisms:
+        parts.append(f"Candidate mechanisms: {obs.candidate_mechanisms[:5]}")
+    parts.append(
+        'Output ONLY a single JSON object with these exact keys, no comments, no extra text:\n'
+        '{"action_type": "<one of the remaining steps>", "method": null, "parameters": {}, "justification": "<why>", "confidence": 0.8}'
+    )
     return "\n".join(parts)
 def build_experiment_action(
     action_type: ActionType,
     discovered_markers: Sequence[str],
+    candidate_mechanisms: Sequence[str],
     conditions: Sequence[str],
 ) -> ExperimentAction:
     method = None
     if action_type == ActionType.COLLECT_SAMPLE:
         parameters = {"n_samples": 6}
         justification = "Collect enough samples to start the experiment."
+    elif action_type == ActionType.SELECT_COHORT:
+        parameters = {
+            "comparison": default_comparison_name(conditions),
+            "conditions": list(conditions[:2]) or ["disease", "healthy"],
+        }
+        justification = "Define the cohort split before committing to downstream analysis."
     elif action_type == ActionType.PREPARE_LIBRARY:
         method = "10x_chromium"
         justification = "Prepare a single-cell library for sequencing."
+    elif action_type == ActionType.CULTURE_CELLS:
+        method = "organoid_culture"
+        parameters = {"duration_days": 7}
+        justification = "Expand viable cells before a perturbation or profiling step."
+    elif action_type == ActionType.PERTURB_GENE:
+        method = "CRISPRi"
+        parameters = {"target_gene": candidate_mechanisms[0] if candidate_mechanisms else "STAT3"}
+        justification = "Test whether a candidate regulator causally shifts cell state."
+    elif action_type == ActionType.PERTURB_COMPOUND:
+        method = "small_molecule_screen"
+        parameters = {"compound": candidate_mechanisms[0] if candidate_mechanisms else "TGFb_inhibitor"}
+        justification = "Probe the pathway hypothesis with a targeted compound perturbation."
     elif action_type == ActionType.SEQUENCE_CELLS:
         method = "NovaSeq"
         justification = "Generate reads for downstream single-cell analysis."
     elif action_type == ActionType.NORMALIZE_DATA:
         method = "scanpy.pp.normalize_total"
         justification = "Normalize counts for comparable expression profiles."
+    elif action_type == ActionType.INTEGRATE_BATCHES:
+        method = "scanorama.integrate"
+        justification = "Correct batch effects before comparing cellular programs."
     elif action_type == ActionType.CLUSTER_CELLS:
         method = "scanpy.tl.leiden"
         justification = "Resolve cell states before interpretation."
     elif action_type == ActionType.MARKER_SELECTION:
         method = "scanpy.tl.rank_genes_groups"
         justification = "Nominate marker genes for validation."
+    elif action_type == ActionType.REGULATORY_NETWORK_INFERENCE:
+        method = "pySCENIC"
+        justification = "Infer upstream regulators behind the observed state changes."
     elif action_type == ActionType.VALIDATE_MARKER:
         method = "qPCR"
         parameters = {"marker": discovered_markers[0] if discovered_markers else "SPP1"}
         justification = "Validate the strongest discovered marker."
+    elif action_type == ActionType.DESIGN_FOLLOWUP:
+        method = "followup_plan"
+        parameters = {"priority_hypothesis": candidate_mechanisms[0] if candidate_mechanisms else "fibrotic_activation"}
+        justification = "Propose the next experiment to disambiguate remaining uncertainty."
+    elif action_type == ActionType.REQUEST_SUBAGENT_REVIEW:
+        method = "peer_review"
+        parameters = {"focus": "experimental_design"}
+        justification = "Request a review of the current self-driving lab plan."
     elif action_type == ActionType.SYNTHESIZE_CONCLUSION:
         top = list(discovered_markers[:5]) if discovered_markers else []
         parameters = {
             "claims": [{
                 "top_markers": top,
+                "causal_mechanisms": list(candidate_mechanisms[:5]),
+                "predicted_pathways": {
+                    mechanism: 0.6
+                    for mechanism in list(candidate_mechanisms[:3])
+                },
                 "confidence": 0.6,
+                "claim_type": "causal" if candidate_mechanisms else "correlational",
+                "claim": f"Synthesis for {default_comparison_name(conditions)}.",
             }],
         }
         justification = "Summarize the current evidence into a conclusion."
                     [action.action_type for action in history_actions],
                 ),
                 discovered_markers=obs.discovered_markers,
+                candidate_mechanisms=obs.candidate_mechanisms,
                 conditions=obs.task.conditions,
             )
             examples.append({
     return compact_preview(value, 80)
+def normalize_action_type(raw_action_type: Any) -> Optional[str]:
+    if not isinstance(raw_action_type, str):
+        return None
+    candidate = raw_action_type.strip().lower()
+    if candidate in ACTION_TYPES:
+        return candidate
+    if candidate in ACTION_TYPE_ALIASES:
+        return ACTION_TYPE_ALIASES[candidate]
+    candidate = re.sub(r"[^a-z0-9]+", "_", candidate).strip("_")
+    if candidate in ACTION_TYPES:
+        return candidate
+    if candidate in ACTION_TYPE_ALIASES:
+        return ACTION_TYPE_ALIASES[candidate]
+    heuristics = [
+        (("collect", "sample"), ActionType.COLLECT_SAMPLE.value),
+        (("cohort",), ActionType.SELECT_COHORT.value),
+        (("library",), ActionType.PREPARE_LIBRARY.value),
+        (("culture",), ActionType.CULTURE_CELLS.value),
+        (("perturb", "gene"), ActionType.PERTURB_GENE.value),
+        (("perturb", "compound"), ActionType.PERTURB_COMPOUND.value),
+        (("sequence",), ActionType.SEQUENCE_CELLS.value),
+        (("qc",), ActionType.RUN_QC.value),
+        (("quality", "control"), ActionType.RUN_QC.value),
+        (("filter",), ActionType.FILTER_DATA.value),
+        (("normal",), ActionType.NORMALIZE_DATA.value),
+        (("integrat", "batch"), ActionType.INTEGRATE_BATCHES.value),
+        (("cluster",), ActionType.CLUSTER_CELLS.value),
+        (("differential", "expression"), ActionType.DIFFERENTIAL_EXPRESSION.value),
+        (("pathway",), ActionType.PATHWAY_ENRICHMENT.value),
+        (("trajectory",), ActionType.TRAJECTORY_ANALYSIS.value),
+        (("network",), ActionType.REGULATORY_NETWORK_INFERENCE.value),
+        (("marker",), ActionType.MARKER_SELECTION.value),
+        (("validat", "marker"), ActionType.VALIDATE_MARKER.value),
+        (("followup",), ActionType.DESIGN_FOLLOWUP.value),
+        (("review",), ActionType.REQUEST_SUBAGENT_REVIEW.value),
+        (("conclusion",), ActionType.SYNTHESIZE_CONCLUSION.value),
+    ]
+    for fragments, normalized in heuristics:
+        if all(fragment in candidate for fragment in fragments):
+            return normalized
+    return None
+def _unique_nonempty(items: Sequence[Any], limit: int = 5) -> List[str]:
+    seen: set[str] = set()
+    result: List[str] = []
+    for raw in items:
+        value = normalize_optional_string(raw)
+        if not value:
+            continue
+        key = value.upper()
+        if key in seen:
+            continue
+        seen.add(key)
+        result.append(value)
+        if len(result) >= limit:
+            break
+    return result
+def _infer_conclusion_evidence(
+    obs: ExperimentObservation,
+) -> Tuple[List[str], List[str], Dict[str, float]]:
+    top_markers = _unique_nonempty(list(obs.discovered_markers), limit=5)
+    causal_mechanisms = _unique_nonempty(list(obs.candidate_mechanisms), limit=5)
+    predicted_pathways: Dict[str, float] = {}
+    for output in reversed(obs.all_outputs):
+        if not output.success:
+            continue
+        data = output.data or {}
+        if not top_markers:
+            markers = data.get("markers", [])
+            if isinstance(markers, list):
+                top_markers = _unique_nonempty(markers, limit=5)
+        if not causal_mechanisms:
+            regulators = data.get("top_regulators", [])
+            if isinstance(regulators, list):
+                causal_mechanisms = _unique_nonempty(regulators, limit=5)
+        if not predicted_pathways:
+            for item in data.get("top_pathways", []):
+                if not isinstance(item, dict):
+                    continue
+                pathway = normalize_optional_string(item.get("pathway"))
+                score = item.get("score")
+                if pathway and isinstance(score, (int, float)):
+                    predicted_pathways[pathway] = float(score)
+                    if len(predicted_pathways) >= 5:
+                        break
+        if top_markers and causal_mechanisms and predicted_pathways:
+            break
+    return top_markers, causal_mechanisms, predicted_pathways
+def ensure_conclusion_claims(
+    obs: ExperimentObservation,
+    action: ExperimentAction,
+) -> ExperimentAction:
+    if action.action_type != ActionType.SYNTHESIZE_CONCLUSION:
+        return action
+    parameters = dict(action.parameters or {})
+    raw_claims = parameters.get("claims")
+    if isinstance(raw_claims, list):
+        normalized_claims = [claim for claim in raw_claims if isinstance(claim, dict)]
+        if normalized_claims:
+            parameters["claims"] = normalized_claims
+            if parameters != action.parameters:
+                return action.model_copy(update={"parameters": parameters})
+            return action
+    top_markers, causal_mechanisms, predicted_pathways = _infer_conclusion_evidence(obs)
+    claim_type = "causal" if causal_mechanisms else "correlational"
+    conditions = " vs ".join(obs.task.conditions[:2]) if obs.task.conditions else "the task conditions"
+    claim = action.justification or f"Final synthesis for {conditions}."
+    parameters["claims"] = [{
+        "top_markers": top_markers,
+        "causal_mechanisms": causal_mechanisms,
+        "predicted_pathways": predicted_pathways,
+        "confidence": action.confidence,
+        "claim_type": claim_type,
+        "claim": claim,
+    }]
+    if not action.justification:
+        action = action.model_copy(update={"justification": claim})
+    return action.model_copy(update={"parameters": parameters})
 def parse_action_completion(text: str) -> Optional[ExperimentAction]:
     payload = extract_json_object(text)
     if payload is not None:
+        action_type = normalize_action_type(get_payload_value(payload, "action_type"))
+        if action_type is None:
             return None
         parameters = get_payload_value(payload, "parameters", "params") or {}
     if not action_match:
         return None
+    action_type = normalize_action_type(action_match.group(1))
+    if action_type is None:
         return None
     method_match = re.search(
             obs = env.step(previous_action)
             if obs.done:
                 return float(obs.reward)
+        action = ensure_conclusion_claims(obs, action)
         obs = env.step(action)
         return float(obs.reward)
     tokenizer: Any,
     prompt_or_observation: str | ExperimentObservation,
     *,
+    max_new_tokens: int = DEFAULT_COMPLETION_TOKEN_BUDGET,
     temperature: float = 0.2,
     top_p: float = 0.9,
     do_sample: bool = True,
     new_tokens = output_ids[0][prompt_tokens:]
     response_text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
     action = parse_action_completion(response_text)
+    if action is not None and isinstance(prompt_or_observation, ExperimentObservation):
+        action = ensure_conclusion_claims(prompt_or_observation, action)
     return {
         "prompt": prompt,
         "response_text": response_text,

training_unsloth.py ADDED Viewed

	@@ -0,0 +1,439 @@

+"""Train and run quantized self-driving lab models with Unsloth.
+This keeps the same OpenEnv prompt + reward wiring as `training_script.py`,
+but arranges the Unsloth path in the more typical pattern:
+1. patch GRPO support
+2. load a quantized model
+3. apply LoRA adapters
+4. train with an explicit OpenEnv reward function
+"""
+from __future__ import annotations
+import argparse
+import random
+from pathlib import Path
+from typing import Any, Dict, Optional, Sequence
+import training_script as base
+DEFAULT_OUTPUT_DIR = "training/grpo-unsloth-output"
+DEFAULT_MAX_SEQ_LENGTH = 2048
+DEFAULT_LORA_R = 16
+DEFAULT_LORA_ALPHA = 16
+DEFAULT_LORA_DROPOUT = 0.0
+LORA_TARGET_MODULES = [
+    "q_proj",
+    "k_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "up_proj",
+    "down_proj",
+]
+def require_unsloth():
+    try:
+        from unsloth import FastLanguageModel, PatchFastRL
+    except ImportError as exc:  # pragma: no cover - depends on optional extra
+        raise RuntimeError(
+            "Unsloth is not installed. Run `uv sync --extra train` "
+            "to install the H100/quantized training dependencies."
+        ) from exc
+    return FastLanguageModel, PatchFastRL
+def _call_unsloth_from_pretrained(FastLanguageModel, **kwargs: Any):
+    for optional_key in ("fast_inference", "trust_remote_code"):
+        try:
+            return FastLanguageModel.from_pretrained(**kwargs)
+        except TypeError as exc:
+            if optional_key in kwargs and optional_key in str(exc):
+                kwargs = dict(kwargs)
+                kwargs.pop(optional_key, None)
+                continue
+            raise
+    return FastLanguageModel.from_pretrained(**kwargs)
+def build_argument_parser() -> argparse.ArgumentParser:
+    parser = base.build_argument_parser()
+    parser.description = (
+        "Train a GRPO policy with Unsloth quantized loading for faster H100 runs."
+    )
+    parser.set_defaults(output_dir=DEFAULT_OUTPUT_DIR)
+    parser.add_argument(
+        "--max-seq-length",
+        type=int,
+        default=DEFAULT_MAX_SEQ_LENGTH,
+        help="Context length passed to Unsloth model loading.",
+    )
+    parser.add_argument(
+        "--disable-4bit",
+        action="store_true",
+        help="Disable 4-bit quantized loading and use the wider base weights.",
+    )
+    parser.add_argument(
+        "--disable-fast-inference",
+        action="store_true",
+        help="Disable Unsloth fast inference kernels where supported.",
+    )
+    parser.add_argument(
+        "--lora-r",
+        type=int,
+        default=DEFAULT_LORA_R,
+        help="LoRA rank used for the quantized GRPO policy.",
+    )
+    parser.add_argument(
+        "--lora-alpha",
+        type=int,
+        default=DEFAULT_LORA_ALPHA,
+        help="LoRA alpha used for the quantized GRPO policy.",
+    )
+    parser.add_argument(
+        "--lora-dropout",
+        type=float,
+        default=DEFAULT_LORA_DROPOUT,
+        help="LoRA dropout used for the quantized GRPO policy.",
+    )
+    parser.add_argument(
+        "--save-merged-16bit",
+        action="store_true",
+        help="Also export a merged 16-bit model after training if supported.",
+    )
+    return parser
+def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace:
+    return build_argument_parser().parse_args(argv)
+def make_training_args(**overrides: Any) -> argparse.Namespace:
+    parser = build_argument_parser()
+    defaults = vars(parser.parse_args([]))
+    unknown = sorted(set(overrides) - set(defaults))
+    if unknown:
+        raise ValueError(f"Unknown training args: {', '.join(unknown)}")
+    defaults.update(overrides)
+    return argparse.Namespace(**defaults)
+def load_model_artifacts(
+    model_id: str,
+    *,
+    trust_remote_code: bool,
+    max_seq_length: int = DEFAULT_MAX_SEQ_LENGTH,
+    load_in_4bit: bool = True,
+    fast_inference: bool = True,
+    prepare_for_inference: bool = False,
+):
+    FastLanguageModel, _ = require_unsloth()
+    runtime = base.resolve_torch_runtime()
+    print(f"Loading Unsloth tokenizer+model for {model_id} ...")
+    model, tokenizer = _call_unsloth_from_pretrained(
+        FastLanguageModel,
+        model_name=model_id,
+        max_seq_length=max_seq_length,
+        dtype="auto",
+        load_in_4bit=load_in_4bit,
+        fast_inference=fast_inference,
+        trust_remote_code=trust_remote_code,
+    )
+    if tokenizer.pad_token is None and tokenizer.eos_token is not None:
+        tokenizer.pad_token = tokenizer.eos_token
+    if prepare_for_inference:
+        try:
+            FastLanguageModel.for_inference(model)
+        except AttributeError:
+            pass
+    device = getattr(model, "device", None)
+    if device is None:
+        try:
+            device = next(model.parameters()).device
+        except StopIteration:
+            device = runtime["device"]
+    print(f"Loaded model on device: {device}")
+    return tokenizer, model
+def build_openenv_reward(args: argparse.Namespace) -> base.OpenEnvReward:
+    """Return the OpenEnv-compatible reward callable used by GRPO."""
+    return base.OpenEnvReward(
+        reward_backend=args.reward_backend,
+        base_url=args.base_url,
+        domain_randomise=args.domain_randomise,
+    )
+def prepare_prompt_examples(args: argparse.Namespace) -> Dict[str, Any]:
+    """Build the OpenEnv rollout states that seed GRPO prompts."""
+    scenario_names = base.selected_scenarios(args.scenario_name)
+    examples = base.build_prompt_examples(
+        dataset_episodes=args.dataset_episodes,
+        rollout_steps=args.rollout_steps,
+        collection_policy=args.collection_policy,
+        scenario_names=scenario_names,
+        seed=args.seed,
+        domain_randomise=args.domain_randomise,
+    )
+    return {
+        "scenario_names": scenario_names,
+        "examples": examples,
+    }
+def patch_unsloth_grpo():
+    """Patch TRL GRPO to use Unsloth's optimized kernels."""
+    FastLanguageModel, PatchFastRL = require_unsloth()
+    PatchFastRL("GRPO", FastLanguageModel)
+    return FastLanguageModel
+def apply_lora_adapters(FastLanguageModel, model: Any, args: argparse.Namespace) -> Any:
+    """Apply LoRA adapters in the usual Unsloth configuration style."""
+    return FastLanguageModel.get_peft_model(
+        model,
+        r=args.lora_r,
+        target_modules=LORA_TARGET_MODULES,
+        lora_alpha=args.lora_alpha,
+        lora_dropout=args.lora_dropout,
+        bias="none",
+        use_gradient_checkpointing=True,
+        random_state=args.seed,
+    )
+def build_grpo_config(
+    args: argparse.Namespace,
+    runtime: Dict[str, Any],
+):
+    from trl import GRPOConfig
+    return GRPOConfig(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        per_device_train_batch_size=args.per_device_train_batch_size,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        num_generations=args.num_generations,
+        max_completion_length=args.max_completion_length,
+        num_train_epochs=args.num_train_epochs,
+        logging_steps=args.logging_steps,
+        save_steps=args.save_steps,
+        bf16=runtime["bf16"],
+        fp16=runtime["fp16"],
+        report_to="none",
+        remove_unused_columns=False,
+    )
+def build_unsloth_grpo_trainer(
+    *,
+    model: Any,
+    tokenizer: Any,
+    reward_func: Any,
+    train_dataset: Any,
+    args: argparse.Namespace,
+    runtime: Dict[str, Any],
+):
+    from trl import GRPOTrainer
+    config = build_grpo_config(args, runtime)
+    return GRPOTrainer(
+        model=model,
+        reward_funcs=reward_func,
+        args=config,
+        train_dataset=train_dataset,
+        processing_class=tokenizer,
+    )
+def generate_action_with_model(
+    model: Any,
+    tokenizer: Any,
+    prompt_or_observation: str | base.ExperimentObservation,
+    *,
+    max_new_tokens: int = base.DEFAULT_COMPLETION_TOKEN_BUDGET,
+    temperature: float = 0.2,
+    top_p: float = 0.9,
+    do_sample: bool = True,
+) -> Dict[str, Any]:
+    import torch
+    if isinstance(prompt_or_observation, base.ExperimentObservation):
+        prompt = base.build_training_prompt(prompt_or_observation)
+    else:
+        prompt = str(prompt_or_observation)
+    model_device = getattr(model, "device", None)
+    if model_device is None:
+        try:
+            model_device = next(model.parameters()).device
+        except StopIteration:
+            model_device = base.resolve_torch_runtime()["device"]
+    inputs = tokenizer(prompt, return_tensors="pt")
+    inputs = {key: value.to(model_device) for key, value in inputs.items()}
+    prompt_tokens = inputs["input_ids"].shape[1]
+    generation_kwargs = {
+        "max_new_tokens": max_new_tokens,
+        "do_sample": do_sample,
+        "temperature": temperature,
+        "top_p": top_p,
+        "pad_token_id": tokenizer.pad_token_id,
+    }
+    with torch.no_grad():
+        output_ids = model.generate(**inputs, **generation_kwargs)
+    new_tokens = output_ids[0][prompt_tokens:]
+    response_text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
+    action = base.parse_action_completion(response_text)
+    if action is not None and isinstance(prompt_or_observation, base.ExperimentObservation):
+        action = base.ensure_conclusion_claims(prompt_or_observation, action)
+    return {
+        "prompt": prompt,
+        "response_text": response_text,
+        "action": action,
+    }
+def run_training(args: argparse.Namespace) -> Dict[str, Any]:
+    random.seed(args.seed)
+    runtime = base.resolve_torch_runtime()
+    if args.load_model_only:
+        tokenizer, model = load_model_artifacts(
+            args.model_id,
+            trust_remote_code=args.trust_remote_code,
+            max_seq_length=args.max_seq_length,
+            load_in_4bit=not args.disable_4bit,
+            fast_inference=not args.disable_fast_inference,
+            prepare_for_inference=True,
+        )
+        device = getattr(model, "device", "unknown")
+        print(f"Unsloth model ready: {args.model_id}")
+        print(f"Tokenizer vocab size: {len(tokenizer)}")
+        print(f"Model device: {device}")
+        print(f"Runtime device name: {runtime['device_name']}")
+        return {
+            "args": args,
+            "runtime": runtime,
+            "tokenizer": tokenizer,
+            "model": model,
+        }
+    prompt_data = prepare_prompt_examples(args)
+    scenario_names = prompt_data["scenario_names"]
+    examples = prompt_data["examples"]
+    env_reward = build_openenv_reward(args)
+    if args.dry_run:
+        base.run_dry_run_preview(examples, env_reward, args.output_dir)
+        return {
+            "args": args,
+            "runtime": runtime,
+            "scenario_names": scenario_names,
+            "examples": examples,
+            "reward_fn": env_reward,
+        }
+    from datasets import Dataset
+    FastLanguageModel = patch_unsloth_grpo()
+    train_dataset = Dataset.from_list(examples)
+    # 1. Load model with Unsloth quantized loading.
+    tokenizer, model = load_model_artifacts(
+        args.model_id,
+        trust_remote_code=args.trust_remote_code,
+        max_seq_length=args.max_seq_length,
+        load_in_4bit=not args.disable_4bit,
+        fast_inference=not args.disable_fast_inference,
+    )
+    # 2. Apply LoRA adapters.
+    model = apply_lora_adapters(FastLanguageModel, model, args)
+    print(
+        f"Unsloth training runtime: device={runtime['device']} "
+        f"name={runtime['device_name']} "
+        f"dtype={runtime['dtype']} "
+        f"load_in_4bit={not args.disable_4bit}"
+    )
+    print(
+        "OpenEnv reward: "
+        f"backend={args.reward_backend} scenarios={len(scenario_names)} "
+        f"examples={len(examples)}"
+    )
+    # 3. Train with GRPO against the OpenEnv reward function.
+    trainer = build_unsloth_grpo_trainer(
+        model=model,
+        tokenizer=tokenizer,
+        reward_func=env_reward,
+        train_dataset=train_dataset,
+        args=args,
+        runtime=runtime,
+    )
+    trainer.train()
+    trainer.save_model(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    if args.save_merged_16bit:
+        merged_dir = Path(args.output_dir) / "merged_16bit"
+        try:
+            model.save_pretrained_merged(
+                str(merged_dir),
+                tokenizer,
+                save_method="merged_16bit",
+            )
+            print(f"Saved merged 16-bit model to {merged_dir}")
+        except AttributeError:
+            print("Merged 16-bit export is not available in this Unsloth build; skipping.")
+    if args.push_to_hub:
+        from huggingface_hub import HfApi
+        api = HfApi()
+        api.create_repo(repo_id=args.push_to_hub, repo_type="model", exist_ok=True)
+        print(f"Pushing model to HuggingFace Hub: {args.push_to_hub}")
+        api.upload_folder(
+            folder_path=args.output_dir,
+            repo_id=args.push_to_hub,
+            repo_type="model",
+            create_pr=False,
+        )
+        print(f"Model pushed to https://huggingface.co/{args.push_to_hub}")
+    plot_paths = base.save_training_plots(
+        trainer.state.log_history,
+        args.output_dir,
+        metric_key=args.plot_metric_key,
+    )
+    print("Saved training plots:")
+    for plot_name, plot_path in plot_paths.items():
+        print(f"  - {plot_name}: {plot_path}")
+    return {
+        "args": args,
+        "runtime": runtime,
+        "scenario_names": scenario_names,
+        "examples": examples,
+        "reward_fn": env_reward,
+        "train_dataset": train_dataset,
+        "tokenizer": tokenizer,
+        "model": model,
+        "trainer": trainer,
+        "plot_paths": plot_paths,
+    }
+def main() -> None:
+    run_training(parse_args())
+if __name__ == "__main__":
+    main()

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff