Spaces:
Running
Running
Add AGT 04/05/07 implementations, server integration, and doc updates
Browse files- AGT 04: build_baseline_scientist_action with domain inference
- AGT 05: check_feasibility with per-dimension checks
- AGT 07: compose_lab_manager_response with action type selection
- Server: real Lab Manager pipeline replaces hardcoded stub responses
- Tests: AGT 04/05/07 test coverage
- Docs: task completion tracking and model selection notes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- README.md +6 -0
- ReplicaLab_Comprehensive_Task_Division.md +52 -27
- docs/agt11_scientist_model_selection.md +45 -0
- docs/ayush/task_breakdown.md +118 -127
- docs/ayush/task_list.md +26 -18
- docs/changes.md +6 -0
- docs/completion.md +94 -28
- docs/kian/task_breakdown.md +15 -19
- docs/kian/task_list.md +29 -16
- docs/kush/task_breakdown.md +4 -4
- docs/kush/task_list.md +3 -2
- docs/max/task_breakdown.md +8 -11
- docs/max/task_list.md +5 -3
- replicalab/agents/__init__.py +2 -0
- replicalab/agents/lab_manager_policy.py +180 -2
- replicalab/agents/scientist_policy.py +180 -0
- server/app.py +44 -11
- tests/test_lab_manager_policy.py +93 -1
- tests/test_scientist_policy.py +100 -0
README.md
CHANGED
|
@@ -131,6 +131,12 @@ pytest tests/
|
|
| 131 |
|
| 132 |
RL training improves the Scientist agent's ability to negotiate effective, feasible plans.
|
| 133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
### Quick Start (Google Colab)
|
| 135 |
|
| 136 |
1. Open `notebooks/train_colab.ipynb` in Google Colab
|
|
|
|
| 131 |
|
| 132 |
RL training improves the Scientist agent's ability to negotiate effective, feasible plans.
|
| 133 |
|
| 134 |
+
### Selected base model
|
| 135 |
+
|
| 136 |
+
- **Primary Scientist model:** `Qwen3-4B`
|
| 137 |
+
- **Stretch fallback:** `Qwen3-8B`
|
| 138 |
+
- **Decision record:** `docs/agt11_scientist_model_selection.md`
|
| 139 |
+
|
| 140 |
### Quick Start (Google Colab)
|
| 141 |
|
| 142 |
1. Open `notebooks/train_colab.ipynb` in Google Colab
|
ReplicaLab_Comprehensive_Task_Division.md
CHANGED
|
@@ -295,12 +295,19 @@ Create a stable shared codebase, contracts, and development workflow so all work
|
|
| 295 |
- Completed scope for `FND 09`: added `openenv.yaml` with OpenEnv manifest metadata plus the minimal repo wiring required for local OpenEnv validation (`openenv-core` dependency, `server` script entry point, `uv.lock`, and `server.app.main()`)
|
| 296 |
- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
|
| 297 |
- Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
|
|
|
|
|
|
|
| 298 |
- Partial backend scope imported from Max's PR: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md` were normalized onto the current standards and validated locally against the stub env
|
| 299 |
-
- Remaining work now unblocked by `FND 01`: `FND 03`
|
| 300 |
- Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
|
| 301 |
- Newly unblocked by `FND 06`: `DOC 01`
|
| 302 |
-
-
|
|
|
|
| 303 |
- Remaining completion items for the imported backend scaffold: real-env integration, Docker validation, and final deployment verification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 304 |
|
| 305 |
### User stories
|
| 306 |
|
|
@@ -316,7 +323,7 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
|
|
| 316 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 317 |
| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
|
| 318 |
| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ✅ Completed | Person B (Ayush) |
|
| 319 |
-
| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully |
|
| 320 |
| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
|
| 321 |
| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ✅ Completed | Person B (Ayush) |
|
| 322 |
| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ✅ Completed | Person B (Ayush) |
|
|
@@ -325,7 +332,7 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
|
|
| 325 |
| FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | ✅ Completed | Person B (Ayush) |
|
| 326 |
| FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | ✅ Completed | Person B (Ayush) |
|
| 327 |
| FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | ✅ Completed | Max (Person C) |
|
| 328 |
-
| FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging |
|
| 329 |
| FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | ⬜ Not started | — |
|
| 330 |
|
| 331 |
---
|
|
@@ -343,11 +350,29 @@ Define the environment contracts cleanly so state, actions, and observations are
|
|
| 343 |
- `MOD 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 344 |
- `MOD 03` status: completed on 2026-03-08
|
| 345 |
- `MOD 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 346 |
- Completed scope for `MOD 01`: replaced the placeholder `ScientistAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, added focused schema tests, and patched the stub server so `accept` no longer overwrites the current protocol with default values
|
| 347 |
- Completed scope for `MOD 02`: replaced the placeholder `LabManagerAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency across budget, equipment, reagent, schedule, and staff checks, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests
|
| 348 |
- Completed scope for `MOD 03`: introduced typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the current stub server and focused tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
- Newly unblocked by `MOD 01`: `MOD 05`, `MOD 09`
|
| 350 |
- Newly unblocked by `MOD 03`: `MOD 04`, `MOD 11`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
|
| 352 |
### User stories
|
| 353 |
|
|
@@ -364,15 +389,15 @@ As the training loop, I need deterministic state serialization so episodes can b
|
|
| 364 |
| MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
|
| 365 |
| MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
|
| 366 |
| MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | ✅ Completed | Person B (Ayush) |
|
| 367 |
-
| MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works |
|
| 368 |
-
| MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons |
|
| 369 |
| MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | ⬜ Not started | — |
|
| 370 |
| MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | ⬜ Not started | — |
|
| 371 |
| MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | ⬜ Not started | — |
|
| 372 |
-
| MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error |
|
| 373 |
| MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | ⬜ Not started | — |
|
| 374 |
-
| MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape |
|
| 375 |
-
| MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code |
|
| 376 |
|
| 377 |
---
|
| 378 |
|
|
@@ -393,17 +418,17 @@ As a judge, I want normalized constraints and resources so the environment tests
|
|
| 393 |
|
| 394 |
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 395 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 396 |
-
| SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env |
|
| 397 |
-
| SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | MOD 04 | 0.75h | all scenario builders return the same normalized top level structure and mapper-ready inputs |
|
| 398 |
-
| SCN 03 | E03.2 | Person A | `replicalab/scenarios/math_reasoning.py` | Implement mathematics template with theorem, proof-goal, tool, time, and review constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
|
| 399 |
-
| SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with dataset, compute, time, and evaluation constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
|
| 400 |
-
| SCN 05 | E03.2 | Person A | `replicalab/scenarios/finance_trading.py` | Implement finance and trading planning template with risk, capital, slippage, and backtest constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
|
| 401 |
-
| SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard by mechanically altering constraints, resources, and conflicts | SCN 03 to SCN 05 | 1h | difficulty visibly changes the normalized scenario pack in a meaningful way |
|
| 402 |
-
| SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement normalized constraint and resource generator for budget, time, compute, personnel, stock, and bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints or resources |
|
| 403 |
-
| SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement hidden reference spec and allowed substitutions per template | SCN 03 to SCN 05 | 1h | hidden reference clearly marks what is fixed versus flexible for deterministic scoring |
|
| 404 |
-
| SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content |
|
| 405 |
-
| SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary |
|
| 406 |
-
| SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing |
|
| 407 |
| SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | ⬜ Not started | — |
|
| 408 |
| SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability | ⬜ Not started | — |
|
| 409 |
|
|
@@ -426,17 +451,17 @@ As the Lab Manager, I want grounded negotiation plus deterministic feasibility c
|
|
| 426 |
|
| 427 |
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 428 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 429 |
-
| AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract |
|
| 430 |
-
| AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently |
|
| 431 |
| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ⬜ Not started | — |
|
| 432 |
-
| AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing |
|
| 433 |
-
| AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension |
|
| 434 |
-
| AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails |
|
| 435 |
-
| AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks |
|
| 436 |
| AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | ⬜ Not started | — |
|
| 437 |
| AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
|
| 438 |
| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
|
| 439 |
-
| AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned |
|
| 440 |
|
| 441 |
---
|
| 442 |
|
|
|
|
| 295 |
- Completed scope for `FND 09`: added `openenv.yaml` with OpenEnv manifest metadata plus the minimal repo wiring required for local OpenEnv validation (`openenv-core` dependency, `server` script entry point, `uv.lock`, and `server.app.main()`)
|
| 296 |
- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
|
| 297 |
- Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
|
| 298 |
+
- Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
|
| 299 |
+
- Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
|
| 300 |
- Partial backend scope imported from Max's PR: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md` were normalized onto the current standards and validated locally against the stub env
|
|
|
|
| 301 |
- Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
|
| 302 |
- Newly unblocked by `FND 06`: `DOC 01`
|
| 303 |
+
- Newly unblocked by `FND 03`: `FND 13`, `UI 01`
|
| 304 |
+
- Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
|
| 305 |
- Remaining completion items for the imported backend scaffold: real-env integration, Docker validation, and final deployment verification
|
| 306 |
+
- Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
|
| 307 |
+
- Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
|
| 308 |
+
- Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
|
| 309 |
+
- Newly unblocked by `SCN 11` and `AGT 01`: `AGT 02`, `AGT 11`, `TRN 04`, `TRN 08`
|
| 310 |
+
- Remaining Epic E03 work after the scenario bundle: `SCN 12`, `SCN 13`
|
| 311 |
|
| 312 |
### User stories
|
| 313 |
|
|
|
|
| 323 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 324 |
| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
|
| 325 |
| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ✅ Completed | Person B (Ayush) |
|
| 326 |
+
| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ✅ Completed | Kush |
|
| 327 |
| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
|
| 328 |
| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ✅ Completed | Person B (Ayush) |
|
| 329 |
| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ✅ Completed | Person B (Ayush) |
|
|
|
|
| 332 |
| FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | ✅ Completed | Person B (Ayush) |
|
| 333 |
| FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | ✅ Completed | Person B (Ayush) |
|
| 334 |
| FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | ✅ Completed | Max (Person C) |
|
| 335 |
+
| FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | ✅ Completed | Kush |
|
| 336 |
| FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | ⬜ Not started | — |
|
| 337 |
|
| 338 |
---
|
|
|
|
| 350 |
- `MOD 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 351 |
- `MOD 03` status: completed on 2026-03-08
|
| 352 |
- `MOD 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 353 |
+
- `MOD 04` status: completed on 2026-03-08
|
| 354 |
+
- `MOD 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 355 |
+
- `MOD 05` status: completed on 2026-03-08
|
| 356 |
+
- `MOD 05` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 357 |
+
- `MOD 11` status: completed on 2026-03-08
|
| 358 |
+
- `MOD 11` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 359 |
+
- `MOD 12` status: completed on 2026-03-08
|
| 360 |
+
- `MOD 12` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 361 |
+
- `MOD 09` status: completed on 2026-03-08
|
| 362 |
- Completed scope for `MOD 01`: replaced the placeholder `ScientistAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, added focused schema tests, and patched the stub server so `accept` no longer overwrites the current protocol with default values
|
| 363 |
- Completed scope for `MOD 02`: replaced the placeholder `LabManagerAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency across budget, equipment, reagent, schedule, and staff checks, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests
|
| 364 |
- Completed scope for `MOD 03`: introduced typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the current stub server and focused tests
|
| 365 |
+
- Completed scope for `MOD 04`: replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs
|
| 366 |
+
- Completed scope for `MOD 05`: added deterministic semantic protocol validation in `replicalab/utils/validation.py` with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack
|
| 367 |
+
- Completed scope for `MOD 11`: introduced typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly
|
| 368 |
+
- Completed scope for `MOD 12`: added `replicalab/config.py` as the shared constants module for default scenario, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults; updated the server and scenario builders to import those constants instead of repeating magic numbers
|
| 369 |
+
- Completed scope for `MOD 09`: added `replicalab/agents/scientist_policy.py` with a raw-text parser that extracts JSON from plain text or fenced blocks, validates it into `ScientistAction`, and raises an explicit `ScientistOutputParseError` for missing JSON, invalid JSON, or schema failures; added focused parser tests and package exports
|
| 370 |
- Newly unblocked by `MOD 01`: `MOD 05`, `MOD 09`
|
| 371 |
- Newly unblocked by `MOD 03`: `MOD 04`, `MOD 11`
|
| 372 |
+
- Newly unblocked by `MOD 04`: `MOD 07`, `ENV 01`
|
| 373 |
+
- Newly unblocked by `MOD 05`: `MOD 06`, `AGT 05`
|
| 374 |
+
- `MOD 11` does not introduce a new formal dependency edge by itself, but it stabilizes `StepResult` metadata for environment, API, replay, and training consumers
|
| 375 |
+
- `MOD 09` does not fully unblock a new task by itself, but it removes one half of the blocker on `AGT 03`; `AGT 03` now only waits on `AGT 02`
|
| 376 |
|
| 377 |
### User stories
|
| 378 |
|
|
|
|
| 389 |
| MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
|
| 390 |
| MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
|
| 391 |
| MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | ✅ Completed | Person B (Ayush) |
|
| 392 |
+
| MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works | ✅ Completed | Person B (Ayush) |
|
| 393 |
+
| MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons | ✅ Completed | Person B (Ayush) |
|
| 394 |
| MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | ⬜ Not started | — |
|
| 395 |
| MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | ⬜ Not started | — |
|
| 396 |
| MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | ⬜ Not started | — |
|
| 397 |
+
| MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error | ✅ Completed | — |
|
| 398 |
| MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | ⬜ Not started | — |
|
| 399 |
+
| MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape | ✅ Completed | Person B (Ayush) |
|
| 400 |
+
| MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code | ✅ Completed | Person B (Ayush) |
|
| 401 |
|
| 402 |
---
|
| 403 |
|
|
|
|
| 418 |
|
| 419 |
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 420 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 421 |
+
| SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env | ✅ Completed | Person B (Ayush) |
|
| 422 |
+
| SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | MOD 04 | 0.75h | all scenario builders return the same normalized top level structure and mapper-ready inputs | ✅ Completed | Person B (Ayush) |
|
| 423 |
+
| SCN 03 | E03.2 | Person A | `replicalab/scenarios/math_reasoning.py` | Implement mathematics template with theorem, proof-goal, tool, time, and review constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ✅ Completed | Person B (Ayush) |
|
| 424 |
+
| SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with dataset, compute, time, and evaluation constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ✅ Completed | Person B (Ayush) |
|
| 425 |
+
| SCN 05 | E03.2 | Person A | `replicalab/scenarios/finance_trading.py` | Implement finance and trading planning template with risk, capital, slippage, and backtest constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ✅ Completed | Person B (Ayush) |
|
| 426 |
+
| SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard by mechanically altering constraints, resources, and conflicts | SCN 03 to SCN 05 | 1h | difficulty visibly changes the normalized scenario pack in a meaningful way | ✅ Completed | Person B (Ayush) |
|
| 427 |
+
| SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement normalized constraint and resource generator for budget, time, compute, personnel, stock, and bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints or resources | ✅ Completed | Person B (Ayush) |
|
| 428 |
+
| SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement hidden reference spec and allowed substitutions per template | SCN 03 to SCN 05 | 1h | hidden reference clearly marks what is fixed versus flexible for deterministic scoring | ✅ Completed | Person B (Ayush) |
|
| 429 |
+
| SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content | ✅ Completed | Person B (Ayush) |
|
| 430 |
+
| SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary | ✅ Completed | Person B (Ayush) |
|
| 431 |
+
| SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing | ✅ Completed | — |
|
| 432 |
| SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | ⬜ Not started | — |
|
| 433 |
| SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability | ⬜ Not started | — |
|
| 434 |
|
|
|
|
| 451 |
|
| 452 |
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 453 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 454 |
+
| AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | ✅ Completed | — |
|
| 455 |
+
| AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | ✅ Completed | — |
|
| 456 |
| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ⬜ Not started | — |
|
| 457 |
+
| AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ✅ Completed | — |
|
| 458 |
+
| AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ✅ Completed | Person B (Ayush) |
|
| 459 |
+
| AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ✅ Completed | — |
|
| 460 |
+
| AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | ✅ Completed | — |
|
| 461 |
| AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | ⬜ Not started | — |
|
| 462 |
| AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
|
| 463 |
| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
|
| 464 |
+
| AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ✅ Completed | — |
|
| 465 |
|
| 466 |
---
|
| 467 |
|
docs/agt11_scientist_model_selection.md
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AGT 11 Scientist Model Selection
|
| 2 |
+
|
| 3 |
+
## Decision
|
| 4 |
+
|
| 5 |
+
The primary Scientist training model is **Qwen3-4B**.
|
| 6 |
+
|
| 7 |
+
The stretch fallback is **Qwen3-8B** if H100-only training is acceptable and
|
| 8 |
+
the 4B model underperforms on structured planning quality.
|
| 9 |
+
|
| 10 |
+
## Why Qwen3-4B
|
| 11 |
+
|
| 12 |
+
- Strong enough for structured JSON action output without moving to a much
|
| 13 |
+
slower large-model loop.
|
| 14 |
+
- Small enough for fast RL iteration on H100 and practical 4-bit Colab use.
|
| 15 |
+
- Open weights with a permissive Apache 2.0 license.
|
| 16 |
+
- A clean fit for the current architecture: train the Scientist first while the
|
| 17 |
+
reward and Lab Manager grounding remain deterministic.
|
| 18 |
+
|
| 19 |
+
## Why Not Smaller
|
| 20 |
+
|
| 21 |
+
- Smaller checkpoints are cheaper, but they are more likely to underperform on
|
| 22 |
+
multi-step technical planning and strict output schemas.
|
| 23 |
+
- The project needs enough reasoning quality to negotiate across mathematics,
|
| 24 |
+
machine learning, and finance-trading scenarios.
|
| 25 |
+
|
| 26 |
+
## Why Not Larger By Default
|
| 27 |
+
|
| 28 |
+
- Larger checkpoints slow rollout collection and raise memory pressure.
|
| 29 |
+
- The judged artifact still needs a credible Colab path, not only an H100-only
|
| 30 |
+
path.
|
| 31 |
+
- Faster iteration matters more than squeezing out a marginal quality gain at
|
| 32 |
+
this stage.
|
| 33 |
+
|
| 34 |
+
## Project Usage
|
| 35 |
+
|
| 36 |
+
- **Scientist MVP training:** `Qwen3-4B`
|
| 37 |
+
- **Stretch Scientist training:** `Qwen3-8B`
|
| 38 |
+
- **Lab Manager future path:** reuse the same base family with a separate
|
| 39 |
+
role-specific adapter if the team later trains both roles
|
| 40 |
+
|
| 41 |
+
## Notes
|
| 42 |
+
|
| 43 |
+
- The reward loop stays deterministic regardless of the model choice.
|
| 44 |
+
- `TRN 14` should mirror this decision on the notebook side once the Colab
|
| 45 |
+
skeleton exists.
|
docs/ayush/task_breakdown.md
CHANGED
|
@@ -9,185 +9,187 @@ No assumptions from other documents are used to reclassify blocked status.
|
|
| 9 |
|
| 10 |
## 1. Blocking Status
|
| 11 |
|
| 12 |
-
|
| 13 |
-
`
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
-
## 2.
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|----|------|-----------|---------------------|-----|
|
| 31 |
-
| AGT 01 | Draft domain-neutral Scientist system prompt | MOD 01, SCN 11 | ScientistAction schema + generate_scenario | 0.75h |
|
| 32 |
-
| AGT 05 | Implement deterministic feasibility checker (shared A+B) | SCN 07, MOD 05 | Constraint generator + validation | 1.25h |
|
| 33 |
-
| SCN 11 | Create golden scenarios for prompt testing | SCN 09 | generate_scenario() | 0.75h |
|
| 34 |
-
| JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | Reward breakdown (A) + logging (C) | 0.5h |
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
|
| 41 |
-
2. **SCN 07 + MOD 05** (normalized constraints/resources + validation) -- unblocks AGT 05, AGT 06, AGT 07
|
| 42 |
-
3. **JDG 05 + JDG 06** (reward breakdown + explanation) -- unblocks AGT 10 and is only part of the path for JDG 10
|
| 43 |
-
4. **SCN 08** (minimum viable replication spec) -- unblocks AGT 06 after AGT 05
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
| ID | Task | Depends On | Blocked By | Est |
|
| 53 |
|----|------|-----------|-----------|-----|
|
| 54 |
-
| AGT
|
| 55 |
-
| AGT 03 | Parse plus retry for malformed output | MOD 09 (B) + AGT 02 (B) | Person B: MOD 09, AGT 02 | 0.75h |
|
| 56 |
-
| AGT 04 | Baseline heuristic Scientist | AGT 02 (B) | Person B: AGT 02 | 1h |
|
| 57 |
-
| AGT 06 | Alternative suggestion logic from allowed substitutions | AGT 05 (A+B), SCN 08 (A) | Person A: SCN 08, Person A+B: AGT 05 | 1h |
|
| 58 |
-
| AGT 07 | Model-backed Lab Manager response synthesis | AGT 05 (A+B) | Person A+B: AGT 05 | 0.75h |
|
| 59 |
-
| AGT 08 | Prompt formatting and parse tests | AGT 01 to AGT 04 (B) | Person B: AGT 01-04 | 0.75h |
|
| 60 |
-
| AGT 10 | Write domain-neutral prompt text files for all 3 roles | AGT 01 (B) + AGT 07 (B) + JDG 06 (A) | Person A: JDG 06, Person B: AGT 01, AGT 07 | 0.75h |
|
| 61 |
-
| AGT 11 | Select and document base model | AGT 01 (B) | Person B: AGT 01 | 0.5h |
|
| 62 |
|
| 63 |
-
**Total:
|
| 64 |
|
| 65 |
---
|
| 66 |
|
| 67 |
-
##
|
| 68 |
|
| 69 |
-
Cannot proceed until
|
| 70 |
|
| 71 |
-
| ID | Task | Depends On |
|
| 72 |
-
|----|------|-----------|----------------
|
| 73 |
| TRN 01 | Notebook skeleton | API 10 | Deployed HF Space | 0.5h |
|
| 74 |
-
| TRN 03 | Env client wrapper in notebook | API 06 | WebSocket handler | 1h |
|
| 75 |
-
| TRN 13 | client.py reusable module | API 06 | WebSocket handler | 1h |
|
| 76 |
|
| 77 |
**Total: 3 tasks, 2.5h**
|
| 78 |
|
| 79 |
-
### What to ask Max for first
|
| 80 |
|
| 81 |
-
1.
|
| 82 |
-
2.
|
| 83 |
|
| 84 |
---
|
| 85 |
|
| 86 |
-
##
|
| 87 |
|
| 88 |
-
These execute in strict order once Person A, Person C, and earlier
|
| 89 |
are done.
|
| 90 |
|
| 91 |
| Order | ID | Task | Depends On | Est |
|
| 92 |
|-------|----|------|-----------|-----|
|
| 93 |
-
| 1 | TRN 02 | Package install and model setup cell | TRN 01
|
| 94 |
-
| 2 | TRN 14 | Select and document base model (notebook side) | TRN 01
|
| 95 |
-
| 3 | TRN 04 | Rollout collection loop | TRN 03
|
| 96 |
-
| 4 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04
|
| 97 |
-
| 5 | TRN 06 | Log episode metrics | JDG 10
|
| 98 |
-
| 6 | TRN 07 | Plot reward curves | TRN 06
|
| 99 |
-
| 7 | TRN 08 | Before vs after eval on fixed seeds | SCN 11
|
| 100 |
-
| 8 | TRN 09 | Policy loading for trained checkpoint | TRN 05
|
| 101 |
-
| 9 | TRN 10 | Export plots to outputs/plots | TRN 07
|
| 102 |
-
| 10 | TRN 15 | Agreement and invalid action rate aggregation | TRN 06
|
| 103 |
-
| 11 | OBS 06 | Log training run metadata | TRN 06
|
| 104 |
|
| 105 |
**Total: 11 tasks, 7.5h**
|
| 106 |
|
| 107 |
---
|
| 108 |
|
| 109 |
-
##
|
| 110 |
|
| 111 |
-
| ID | Task | Depends On |
|
| 112 |
-
|----|------|-----------|-----------------
|
| 113 |
-
| TST 09 | Notebook smoke test for fresh runtime | TRN 12 | Evaluation storytelling
|
| 114 |
|
| 115 |
**Total: 1 task, 0.5h**
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
-
##
|
| 120 |
-
|
| 121 |
-
All phases are gated by the listed external dependency being delivered first.
|
| 122 |
|
| 123 |
-
### Phase 1:
|
| 124 |
|
| 125 |
-
1.
|
| 126 |
-
2.
|
| 127 |
-
3.
|
|
|
|
|
|
|
| 128 |
|
| 129 |
-
### Phase 2:
|
| 130 |
|
| 131 |
-
|
| 132 |
-
5. **AGT 01** -- Draft domain-neutral Scientist system prompt
|
| 133 |
-
6. **AGT 11** -- Select and document base model
|
| 134 |
|
| 135 |
-
### Phase 3: After AGT
|
| 136 |
|
| 137 |
-
7.
|
| 138 |
-
8. **AGT 03** -- Add parse plus retry logic
|
| 139 |
-
9. **AGT 04** -- Build baseline heuristic Scientist
|
| 140 |
-
10. **AGT 08** -- Write tests for prompt formatting and parsing
|
| 141 |
|
| 142 |
-
### Phase 4: After
|
| 143 |
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
13. **AGT 07** -- Model-backed Lab Manager response synthesis
|
| 147 |
-
14. **AGT 10** -- Write all domain-neutral prompt text files
|
| 148 |
-
15. **JDG 10** -- Expose component metrics for training plots
|
| 149 |
|
| 150 |
-
### Phase 5: After Max
|
| 151 |
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
|
| 158 |
-
### Phase 6: Training
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
|
| 170 |
-
### Phase 7:
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
---
|
| 175 |
|
| 176 |
-
##
|
| 177 |
|
| 178 |
| Category | Count | Hours |
|
| 179 |
|----------|-------|-------|
|
| 180 |
| Active now | 1 | 0.75h |
|
| 181 |
-
|
|
| 182 |
-
| Blocked by
|
| 183 |
-
|
|
| 184 |
-
|
|
| 185 |
-
|
|
| 186 |
-
|
|
|
|
|
| 187 |
|
| 188 |
---
|
| 189 |
|
| 190 |
-
##
|
| 191 |
|
| 192 |
### Trainable Scientist policy
|
| 193 |
|
|
@@ -212,7 +214,8 @@ of the training reward loop.
|
|
| 212 |
|
| 213 |
### Hybrid Lab Manager
|
| 214 |
|
| 215 |
-
The MVP Lab Manager path is
|
|
|
|
| 216 |
- A deterministic feasibility checker remains the source of truth for
|
| 217 |
`feasible`, constraint flags, and any final structured `LabManagerAction`.
|
| 218 |
- Model-backed response generation is used for negotiation language and
|
|
@@ -220,14 +223,15 @@ The MVP Lab Manager path is now hybrid:
|
|
| 220 |
- The reward formula does not change. The deterministic rubric scores the final
|
| 221 |
plan against the hidden reference spec regardless of how the Lab Manager
|
| 222 |
generates its language.
|
| 223 |
-
- Reward does not split into separate Scientist
|
| 224 |
Both roles share the same cooperative reward signal.
|
| 225 |
- If the team later shares one base model across both roles, the pragmatic
|
| 226 |
-
default is one base model (Qwen3-4B) with separate role-specific adapters.
|
| 227 |
|
| 228 |
### Prompt assembly
|
| 229 |
|
| 230 |
Ayush-owned prompts should be assembled from normalized scenario data:
|
|
|
|
| 231 |
- `task_summary`
|
| 232 |
- `success_criteria`
|
| 233 |
- `constraints`
|
|
@@ -240,20 +244,7 @@ physics, or biology.
|
|
| 240 |
|
| 241 |
---
|
| 242 |
|
| 243 |
-
##
|
| 244 |
-
|
| 245 |
-
| Risk | Impact | Mitigation |
|
| 246 |
-
|------|--------|------------|
|
| 247 |
-
| Person A SCN 09 or MOD 05 delayed | Blocks AGT 01 via SCN 11 and delays AGT 05-07 plus downstream work | Communicate priority order to Person A early |
|
| 248 |
-
| Person C API delayed | Blocks entire training pipeline (TRN 01-15) | Coordinate with Person C on API 06 timeline |
|
| 249 |
-
| Qwen3-4B underperforms on structured output | Scientist produces low quality protocols | Fall back to Qwen3-8B on H100, use reduced-scale Colab fallback |
|
| 250 |
-
| RL training produces flat rewards | No improvement to demo | Have baseline heuristic ready, tune reward weights with Person A |
|
| 251 |
-
| Scientist produces invalid JSON | Rollout loop crashes | AGT 03 parse plus retry is critical, build it robust |
|
| 252 |
-
| Hybrid Lab Manager increases variance if generation settings are too loose | Slower RL convergence | Keep checker as source of truth, use low-variance generation or frozen manager weights during Scientist training |
|
| 253 |
-
|
| 254 |
-
---
|
| 255 |
-
|
| 256 |
-
## 11. Files Person B Owns
|
| 257 |
|
| 258 |
| File | Purpose |
|
| 259 |
|------|---------|
|
|
|
|
| 9 |
|
| 10 |
## 1. Blocking Status
|
| 11 |
|
| 12 |
+
`FND 08`, `FND 09`, `MOD 09`, `SCN 11`, and `AGT 01` are now complete.
|
| 13 |
+
The scenario prerequisite bundle (`SCN 01` to `SCN 10`) also exists in the
|
| 14 |
+
repo, so Ayush no longer waits on `SCN 09` to start prompt-layer work.
|
| 15 |
+
|
| 16 |
+
Ayush now has one fully unblocked task:
|
| 17 |
+
|
| 18 |
+
1. `AGT 03` -- highest leverage next task inside the Scientist chain
|
| 19 |
+
|
| 20 |
+
The prompt and Lab Manager workstream continues to assume a normalized scenario
|
| 21 |
+
pack below the stable outer contract, so Ayush-owned prompting should be
|
| 22 |
+
assembled from mapped scenario data rather than hard-coded to one domain.
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
## 2. Active Now
|
| 27 |
|
| 28 |
+
| ID | Task | Depends On | Why It Is Ready | Est |
|
| 29 |
+
|----|------|-----------|-----------------|-----|
|
| 30 |
+
| AGT 03 | Parse plus retry for malformed output | MOD 09, AGT 02 | The parser and observation formatter are now both complete | 0.75h |
|
| 31 |
+
|
| 32 |
+
**Total: 1 task, 0.75h**
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
|
| 36 |
+
## 3. Internal Ayush Chain After AGT 03
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
These are blocked only by earlier Ayush-owned work.
|
| 39 |
|
| 40 |
+
| ID | Task | Depends On | Blocked By | Est |
|
| 41 |
+
|----|------|-----------|-----------|-----|
|
| 42 |
+
| AGT 08 | Prompt formatting and parse tests | AGT 01 to AGT 04 | Person B: AGT 03 | 0.75h |
|
| 43 |
|
| 44 |
+
**Total: 1 task, 0.75h**
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
+
## 4. Still Blocked by Kian (Person A) or Mixed A+B Work
|
| 49 |
+
|
| 50 |
+
| ID | Task | Depends On | Remaining External Deliverable | Est |
|
| 51 |
+
|----|------|-----------|-------------------------------|-----|
|
| 52 |
+
| JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | `JDG 05` from Kian and `JDG 07` from Max | 0.5h |
|
| 53 |
+
|
| 54 |
+
**Total: 1 task, 0.5h**
|
| 55 |
+
|
| 56 |
+
### What to ask Kian for first
|
| 57 |
|
| 58 |
+
1. `JDG 05` and `JDG 06` -- unlock `JDG 10` and later `AGT 10`
|
| 59 |
+
2. `SCN 13` -- deepens the booking-conflict layer for the Lab Manager path
|
| 60 |
+
3. `ENV 01` -- makes the real environment path available beyond the stub server
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## 5. Mixed Chain After AGT 05 and Judge Work
|
| 65 |
+
|
| 66 |
+
These depend on both Ayush-owned work and remaining upstream work.
|
| 67 |
|
| 68 |
| ID | Task | Depends On | Blocked By | Est |
|
| 69 |
|----|------|-----------|-----------|-----|
|
| 70 |
+
| AGT 10 | Write domain-neutral prompt text files for all 3 roles | AGT 01, AGT 07, JDG 06 | Person A: JDG 06 | 0.75h |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
**Total: 1 task, 0.75h**
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
+
## 6. Blocked by Max (Person C)
|
| 77 |
|
| 78 |
+
Cannot proceed until Max delivers the server and deployment pieces.
|
| 79 |
|
| 80 |
+
| ID | Task | Depends On | Max Deliverable | Est |
|
| 81 |
+
|----|------|-----------|----------------|-----|
|
| 82 |
| TRN 01 | Notebook skeleton | API 10 | Deployed HF Space | 0.5h |
|
| 83 |
+
| TRN 03 | Env client wrapper in notebook | API 06 | WebSocket handler against the real env | 1h |
|
| 84 |
+
| TRN 13 | `client.py` reusable module | API 06 | WebSocket handler against the real env | 1h |
|
| 85 |
|
| 86 |
**Total: 3 tasks, 2.5h**
|
| 87 |
|
| 88 |
+
### What to ask Max for first
|
| 89 |
|
| 90 |
+
1. `API 06` -- unblocks `TRN 03` and `TRN 13`
|
| 91 |
+
2. `API 10` -- unblocks `TRN 01`
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
+
## 7. Deep Training Chain
|
| 96 |
|
| 97 |
+
These execute in strict order once Person A, Person C, and earlier Ayush tasks
|
| 98 |
are done.
|
| 99 |
|
| 100 |
| Order | ID | Task | Depends On | Est |
|
| 101 |
|-------|----|------|-----------|-----|
|
| 102 |
+
| 1 | TRN 02 | Package install and model setup cell | TRN 01 | 0.75h |
|
| 103 |
+
| 2 | TRN 14 | Select and document base model (notebook side) | TRN 01 | 0.5h |
|
| 104 |
+
| 3 | TRN 04 | Rollout collection loop | TRN 03, AGT 01 | 1h |
|
| 105 |
+
| 4 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | 1.25h |
|
| 106 |
+
| 5 | TRN 06 | Log episode metrics | JDG 10, TRN 04 | 0.75h |
|
| 107 |
+
| 6 | TRN 07 | Plot reward curves | TRN 06 | 0.5h |
|
| 108 |
+
| 7 | TRN 08 | Before vs after eval on fixed seeds | SCN 11, TRN 05 | 1h |
|
| 109 |
+
| 8 | TRN 09 | Policy loading for trained checkpoint | TRN 05 | 0.5h |
|
| 110 |
+
| 9 | TRN 10 | Export plots to outputs/plots | TRN 07 | 0.25h |
|
| 111 |
+
| 10 | TRN 15 | Agreement and invalid action rate aggregation | TRN 06, TRN 08, OBS 09 | 0.5h |
|
| 112 |
+
| 11 | OBS 06 | Log training run metadata | TRN 06 | 0.5h |
|
| 113 |
|
| 114 |
**Total: 11 tasks, 7.5h**
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
+
## 8. Blocked by Kush (Person D)
|
| 119 |
|
| 120 |
+
| ID | Task | Depends On | Kush Deliverable | Est |
|
| 121 |
+
|----|------|-----------|-----------------|-----|
|
| 122 |
+
| TST 09 | Notebook smoke test for fresh runtime | TRN 12 | Evaluation storytelling and final notebook flow | 0.5h |
|
| 123 |
|
| 124 |
**Total: 1 task, 0.5h**
|
| 125 |
|
| 126 |
---
|
| 127 |
|
| 128 |
+
## 9. Recommended Execution Order
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
### Phase 1: Completed
|
| 131 |
|
| 132 |
+
1. `FND 08`
|
| 133 |
+
2. `FND 09`
|
| 134 |
+
3. `MOD 09`
|
| 135 |
+
4. `SCN 11`
|
| 136 |
+
5. `AGT 01`
|
| 137 |
|
| 138 |
+
### Phase 2: Active now
|
| 139 |
|
| 140 |
+
6. `AGT 03`
|
|
|
|
|
|
|
| 141 |
|
| 142 |
+
### Phase 3: After AGT 03
|
| 143 |
|
| 144 |
+
7. `AGT 08`
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
### Phase 4: After judge work
|
| 147 |
|
| 148 |
+
8. `AGT 10`
|
| 149 |
+
9. `JDG 10`
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
+
### Phase 5: After Max lands `API 06` and `API 10`
|
| 152 |
|
| 153 |
+
10. `TRN 13`
|
| 154 |
+
11. `TRN 01`
|
| 155 |
+
12. `TRN 02`
|
| 156 |
+
13. `TRN 03`
|
| 157 |
+
14. `TRN 14`
|
| 158 |
|
| 159 |
+
### Phase 6: Training pipeline
|
| 160 |
|
| 161 |
+
15. `TRN 04`
|
| 162 |
+
16. `TRN 05`
|
| 163 |
+
17. `TRN 06`
|
| 164 |
+
18. `TRN 07`
|
| 165 |
+
19. `TRN 08`
|
| 166 |
+
20. `TRN 09`
|
| 167 |
+
21. `TRN 10`
|
| 168 |
+
22. `TRN 15`
|
| 169 |
+
23. `OBS 06`
|
| 170 |
|
| 171 |
+
### Phase 7: Final notebook validation
|
| 172 |
|
| 173 |
+
24. `TST 09`
|
| 174 |
|
| 175 |
---
|
| 176 |
|
| 177 |
+
## 10. Summary Table
|
| 178 |
|
| 179 |
| Category | Count | Hours |
|
| 180 |
|----------|-------|-------|
|
| 181 |
| Active now | 1 | 0.75h |
|
| 182 |
+
| Internal Ayush chain after AGT 03 | 1 | 0.75h |
|
| 183 |
+
| Blocked by Kian or mixed A+B work | 1 | 0.5h |
|
| 184 |
+
| Mixed chain after AGT 05 and judge work | 1 | 0.75h |
|
| 185 |
+
| Blocked by Max | 3 | 2.5h |
|
| 186 |
+
| Deep training chain | 11 | 7.5h |
|
| 187 |
+
| Blocked by Kush | 1 | 0.5h |
|
| 188 |
+
| **Total remaining** | **19** | **13.25h** |
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
+
## 11. Base Model Assumptions
|
| 193 |
|
| 194 |
### Trainable Scientist policy
|
| 195 |
|
|
|
|
| 214 |
|
| 215 |
### Hybrid Lab Manager
|
| 216 |
|
| 217 |
+
The MVP Lab Manager path is hybrid:
|
| 218 |
+
|
| 219 |
- A deterministic feasibility checker remains the source of truth for
|
| 220 |
`feasible`, constraint flags, and any final structured `LabManagerAction`.
|
| 221 |
- Model-backed response generation is used for negotiation language and
|
|
|
|
| 223 |
- The reward formula does not change. The deterministic rubric scores the final
|
| 224 |
plan against the hidden reference spec regardless of how the Lab Manager
|
| 225 |
generates its language.
|
| 226 |
+
- Reward does not split into separate Scientist versus Lab Manager objectives.
|
| 227 |
Both roles share the same cooperative reward signal.
|
| 228 |
- If the team later shares one base model across both roles, the pragmatic
|
| 229 |
+
default is one base model (`Qwen3-4B`) with separate role-specific adapters.
|
| 230 |
|
| 231 |
### Prompt assembly
|
| 232 |
|
| 233 |
Ayush-owned prompts should be assembled from normalized scenario data:
|
| 234 |
+
|
| 235 |
- `task_summary`
|
| 236 |
- `success_criteria`
|
| 237 |
- `constraints`
|
|
|
|
| 244 |
|
| 245 |
---
|
| 246 |
|
| 247 |
+
## 12. Files Person B Owns
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 248 |
|
| 249 |
| File | Purpose |
|
| 250 |
|------|---------|
|
docs/ayush/task_list.md
CHANGED
|
@@ -6,41 +6,47 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 6 |
|
| 7 |
## Current status
|
| 8 |
|
| 9 |
-
- `FND 04` is complete in `replicalab/models.py`
|
| 10 |
- `FND 08` is complete in `docs/fnd08_frozen_json_contract.md`
|
| 11 |
-
- `
|
| 12 |
-
- `
|
| 13 |
-
- `
|
| 14 |
-
-
|
| 15 |
-
-
|
| 16 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
## Epic E02. Domain Models
|
| 21 |
|
| 22 |
-
- [
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
## Epic E03. Scenario Engine
|
| 27 |
|
| 28 |
-
- [
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
## Epic E04. Scientist Agent and Lab Manager Policy
|
| 33 |
|
| 34 |
-
- [
|
| 35 |
-
- [
|
| 36 |
-
- [
|
| 37 |
-
- [
|
| 38 |
-
- [
|
| 39 |
-
- [
|
| 40 |
-
- [
|
|
|
|
| 41 |
- [ ] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04
|
| 42 |
- [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
|
| 43 |
-
- [ ] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01
|
| 44 |
|
| 45 |
---
|
| 46 |
|
|
@@ -82,7 +88,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 82 |
|
| 83 |
## Shared Tasks
|
| 84 |
|
| 85 |
-
- [x] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04
|
| 86 |
|
| 87 |
---
|
| 88 |
|
|
@@ -91,4 +97,6 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 91 |
| Metric | Value |
|
| 92 |
|--------|-------|
|
| 93 |
| Total tasks | 29 |
|
|
|
|
|
|
|
| 94 |
| Total estimated hours | 21.5h |
|
|
|
|
| 6 |
|
| 7 |
## Current status
|
| 8 |
|
|
|
|
| 9 |
- `FND 08` is complete in `docs/fnd08_frozen_json_contract.md`
|
| 10 |
+
- `MOD 09` is complete in `replicalab/agents/scientist_policy.py`
|
| 11 |
+
- `SCN 11` is complete in `tests/fixtures/golden_scenarios.json`
|
| 12 |
+
- `AGT 01` is complete in `replicalab/agents/scientist_policy.py`
|
| 13 |
+
- `AGT 02` is complete in `replicalab/agents/scientist_policy.py`
|
| 14 |
+
- `AGT 04` is complete in `replicalab/agents/scientist_policy.py`
|
| 15 |
+
- `AGT 05` is complete in `replicalab/agents/lab_manager_policy.py`
|
| 16 |
+
- `AGT 06` is complete in `replicalab/agents/lab_manager_policy.py`
|
| 17 |
+
- `AGT 07` is complete in `replicalab/agents/lab_manager_policy.py`
|
| 18 |
+
- `AGT 11` is complete in `docs/agt11_scientist_model_selection.md`
|
| 19 |
+
- The scenario prerequisite bundle (`SCN 01` to `SCN 10`) is now present in the repo, so Ayush prompt work is backed by real normalized scenario packs instead of placeholders
|
| 20 |
+
- The next fully unblocked Ayush task is `AGT 03`
|
| 21 |
+
- `AGT 03` is now the highest-leverage next step because the formatter and parser are both in place, so the retry loop can complete the Scientist action path end-to-end
|
| 22 |
+
- `AGT 10` now waits only on `JDG 06`
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
## Epic E02. Domain Models
|
| 27 |
|
| 28 |
+
- [x] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01 | Status: completed on 2026-03-08
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
## Epic E03. Scenario Engine
|
| 33 |
|
| 34 |
+
- [x] **SCN 11** | Create hand checked golden scenarios for prompt testing | 0.75h | Depends: SCN 09 | Status: completed on 2026-03-08
|
| 35 |
|
| 36 |
---
|
| 37 |
|
| 38 |
## Epic E04. Scientist Agent and Lab Manager Policy
|
| 39 |
|
| 40 |
+
- [x] **AGT 01** | Draft domain-neutral system prompt for Scientist role from normalized scenario data | 0.75h | Depends: MOD 01, SCN 11 | Status: completed on 2026-03-08
|
| 41 |
+
- [x] **AGT 02** | Build observation to prompt formatting helper from normalized scenario-derived observations | 0.75h | Depends: AGT 01, MOD 03 | Status: completed on 2026-03-08
|
| 42 |
+
- [x] **AGT 04** | Build baseline heuristic Scientist for non trained smoke tests | 1h | Depends: AGT 02 | Status: completed on 2026-03-08
|
| 43 |
+
- [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
|
| 44 |
+
- [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
|
| 45 |
+
- [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
|
| 46 |
+
- [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
|
| 47 |
+
- [ ] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: ready now
|
| 48 |
- [ ] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04
|
| 49 |
- [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
|
|
|
|
| 50 |
|
| 51 |
---
|
| 52 |
|
|
|
|
| 88 |
|
| 89 |
## Shared Tasks
|
| 90 |
|
| 91 |
+
- [x] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04 | Status: completed and signed off
|
| 92 |
|
| 93 |
---
|
| 94 |
|
|
|
|
| 97 |
| Metric | Value |
|
| 98 |
|--------|-------|
|
| 99 |
| Total tasks | 29 |
|
| 100 |
+
| Completed | 10 |
|
| 101 |
+
| Remaining | 19 |
|
| 102 |
| Total estimated hours | 21.5h |
|
docs/changes.md
CHANGED
|
@@ -23,5 +23,11 @@ Rules:
|
|
| 23 |
| 2026-03-08 | Person B (Ayush) | FND 08 and FND 09 | Recorded Kian-side sign-off for the shared contract and executed `FND 09` even though it was assigned to Person A | The same contributor is currently covering both the Kian and Ayush lanes, and the OpenEnv registration layer needed to be real rather than left as a placeholder | `FND 08` is now complete, `openenv.yaml` exists, and the repo now carries the minimal OpenEnv runtime wiring needed for local validation | The real environment class in `replicalab/env/replicalab_env.py` is still a later task |
|
| 24 |
| 2026-03-08 | Person B (Ayush) | MOD 01 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict `ScientistAction` validator was the highest-leverage unblocker for downstream parser and validation work | `ScientistAction` now enforces the frozen contract, `MOD 09` and `MOD 05` are unblocked, and focused schema tests now exist in `tests/test_models.py` | `MOD 03` is the next schema-critical Kian task |
|
| 25 |
| 2026-03-08 | Person B (Ayush) | MOD 02 and MOD 03 | Executed the tasks even though they were assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict Lab Manager plus typed observation contracts were the fastest way to stabilize the shared schema surface before parser, state, and environment work fan out | `LabManagerAction`, `ConversationEntry`, `Protocol`, and both observation branches now enforce the frozen contract, `MOD 04` and `MOD 11` are unblocked, and the stub server path is verified against the typed models | `MOD 12`, `SCN 01`, and `MOD 05` are the next Kian-lane tasks |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
| 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
|
|
|
|
| 27 |
|
|
|
|
| 23 |
| 2026-03-08 | Person B (Ayush) | FND 08 and FND 09 | Recorded Kian-side sign-off for the shared contract and executed `FND 09` even though it was assigned to Person A | The same contributor is currently covering both the Kian and Ayush lanes, and the OpenEnv registration layer needed to be real rather than left as a placeholder | `FND 08` is now complete, `openenv.yaml` exists, and the repo now carries the minimal OpenEnv runtime wiring needed for local validation | The real environment class in `replicalab/env/replicalab_env.py` is still a later task |
|
| 24 |
| 2026-03-08 | Person B (Ayush) | MOD 01 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict `ScientistAction` validator was the highest-leverage unblocker for downstream parser and validation work | `ScientistAction` now enforces the frozen contract, `MOD 09` and `MOD 05` are unblocked, and focused schema tests now exist in `tests/test_models.py` | `MOD 03` is the next schema-critical Kian task |
|
| 25 |
| 2026-03-08 | Person B (Ayush) | MOD 02 and MOD 03 | Executed the tasks even though they were assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict Lab Manager plus typed observation contracts were the fastest way to stabilize the shared schema surface before parser, state, and environment work fan out | `LabManagerAction`, `ConversationEntry`, `Protocol`, and both observation branches now enforce the frozen contract, `MOD 04` and `MOD 11` are unblocked, and the stub server path is verified against the typed models | `MOD 12`, `SCN 01`, and `MOD 05` are the next Kian-lane tasks |
|
| 26 |
+
| 2026-03-08 | Person B (Ayush) | MOD 12 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and centralizing shared defaults was the cleanest way to stop config drift before the real environment and scoring modules expand | `replicalab/config.py` now holds shared defaults for scenario selection, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, and the server plus scenario builders import them instead of repeating literals | `MOD 05`, `MOD 04`, and `MOD 11` remain the next Kian-lane foundation tasks |
|
| 27 |
+
| 2026-03-08 | Person B (Ayush) | MOD 11 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and a typed step-result contract was needed before the environment, API, replay, and training paths grew around loose metadata | `RewardBreakdown`, `StepInfo`, and typed `StepResult.info` now exist, and the stub runtime explicitly constructs those reserved-key payloads while preserving debug metadata | `MOD 04` and `MOD 05` were the remaining Kian-lane foundation tasks after this |
|
| 28 |
+
| 2026-03-08 | Person B (Ayush) | MOD 04 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and state plus replay needed to use the same typed protocol and conversation models already enforced at the action and observation layers | `EpisodeState` and `EpisodeLog` now carry typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` fields, the stub runtime constructs those nested models explicitly, and replay serialization is now aligned with the typed contract | `MOD 07` and `ENV 01` are now unblocked |
|
| 29 |
+
| 2026-03-08 | Person B (Ayush) | MOD 05 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and structural schema validation was not enough to stop impossible or hallucinated plans from reaching the environment | `replicalab/utils/validation.py` now provides deterministic protocol validation against normalized scenario resources, substitutions, time limits, and required elements, returning structured issues instead of relying on ad hoc runtime checks | `MOD 06` and shared `AGT 05` are now unblocked |
|
| 30 |
+
| 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
|
| 31 |
| 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
|
| 32 |
+
| 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
|
| 33 |
|
docs/completion.md
CHANGED
|
@@ -20,27 +20,30 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 20 |
| Metric | Value |
|
| 21 |
|--------|-------|
|
| 22 |
| Total tasks | 152 |
|
| 23 |
-
| Completed |
|
| 24 |
| Partial / active | 10 |
|
| 25 |
-
| Remaining |
|
| 26 |
-
| **Completion rate** | **
|
| 27 |
|
| 28 |
### Completion by Person
|
| 29 |
|
| 30 |
| Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
|
| 31 |
|--------|----------|----------------|----------------------|-----------|------|
|
| 32 |
-
| Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) |
|
| 33 |
-
| Person B (Ayush) | 29 (27 solo + 2 shared with A) |
|
| 34 |
-
| Max (Person C) | 41 | 1 (`FND 11`) |
|
| 35 |
| Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
|
| 36 |
-
| All (shared) | 3 |
|
| 37 |
-
|
| 38 |
-
Note: Person B (Ayush) has completed
|
| 39 |
-
(`FND 08`
|
| 40 |
-
|
| 41 |
-
`FND
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
---
|
| 46 |
|
|
@@ -49,10 +52,10 @@ task ready now: `MOD 09`.
|
|
| 49 |
| ID | Assigned To | Current Status | Remaining Acceptance Item |
|
| 50 |
|----|-------------|----------------|---------------------------|
|
| 51 |
| API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the stub env | Real env dependency and task-owner sign-off |
|
| 52 |
-
| API 02 | Max (Person C) | `/reset` works locally against the stub env | Real env reset dependency and task-owner sign-off |
|
| 53 |
| API 03 | Max (Person C) | `/step` works locally against the stub env | Real env step dependency and task-owner sign-off |
|
| 54 |
-
| API 04 | Max (Person C) | `/scenarios`
|
| 55 |
-
| API 06 | Max (Person C) | WebSocket reset, ping, and step work locally against the stub env | Real env integration and task-owner sign-off |
|
| 56 |
| API 07 | Max (Person C) | Idle timeout and cleanup logic exist in the WebSocket path | Real env disconnect cleanup verification |
|
| 57 |
| API 08 | Max (Person C) | `server/Dockerfile` exists | Local Docker build and run verification |
|
| 58 |
| API 13 | Max (Person C) | CORS middleware exists for dev and hosted origins | Frontend integration verification |
|
|
@@ -78,6 +81,41 @@ task ready now: `MOD 09`.
|
|
| 78 |
| MOD 01 | E02 | Person A | Implement `ScientistAction` schema | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Replaced the `ScientistAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so `accept` preserves the current protocol. | Valid scientist actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` and a stub-env `ScientistAction.model_validate(...)` smoke step |
|
| 79 |
| MOD 02 | E02 | Person A | Implement `LabManagerAction` schema | `replicalab/models.py`, `tests/test_models.py` | 2026-03-08 | Replaced the `LabManagerAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests. | Valid lab manager actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` |
|
| 80 |
| MOD 03 | E02 | Person A | Implement role specific observation models | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Added typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. | Scientist and lab observations serialize to JSON with stable keys | Yes - verified with `python -m pytest tests/test_models.py` and a stub `reset()` / `step()` JSON smoke test |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
### Shared Tasks - Completed
|
| 83 |
|
|
@@ -105,6 +143,7 @@ task ready now: `MOD 09`.
|
|
| 105 |
|---------------|-------------------|
|
| 106 |
| FND 01 | FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10 |
|
| 107 |
| FND 02 | FND 11 |
|
|
|
|
| 108 |
| FND 04 | FND 08, FND 09 |
|
| 109 |
| FND 05 | No downstream dependencies |
|
| 110 |
| FND 06 | DOC 01 |
|
|
@@ -113,21 +152,48 @@ task ready now: `MOD 09`.
|
|
| 113 |
| FND 09 | OpenEnv registration layer is now present for later `/web` and deployment work |
|
| 114 |
| FND 10 | No downstream dependencies |
|
| 115 |
| FND 11 | No new formal dependencies, but server scaffold work can now install from a standalone requirements file |
|
|
|
|
| 116 |
| MOD 01 | MOD 05, MOD 09 |
|
| 117 |
| MOD 02 | No new formal dependencies, but the Lab Manager contract is now stable for later policy work |
|
| 118 |
| MOD 03 | MOD 04, MOD 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
### Current Unblocked and Active Tasks
|
| 121 |
|
| 122 |
| ID | Owner | Task | Unblocked By |
|
| 123 |
|----|-------|------|-------------|
|
| 124 |
-
| FND
|
| 125 |
-
|
|
| 126 |
-
|
|
| 127 |
-
| MOD
|
| 128 |
-
| MOD
|
| 129 |
-
|
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
| 131 |
| DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
|
| 132 |
|
| 133 |
---
|
|
@@ -136,10 +202,10 @@ task ready now: `MOD 09`.
|
|
| 136 |
|
| 137 |
| Epic | Total Tasks | Completed | Rate |
|
| 138 |
|------|------------|-----------|------|
|
| 139 |
-
| E01. Foundations and repository setup | 13 |
|
| 140 |
-
| E02. Domain models, validation, state contracts | 12 |
|
| 141 |
-
| E03. Scenario engine and constraint generation | 13 |
|
| 142 |
-
| E04. Scientist agent and Lab Manager policy | 11 |
|
| 143 |
| E05. Judge engine and reward logic | 11 | 0 | 0% |
|
| 144 |
| E06. OpenEnv environment implementation | 11 | 0 | 0% |
|
| 145 |
| E07. API, server, Docker, deployment | 19 | 0 | 0% |
|
|
|
|
| 20 |
| Metric | Value |
|
| 21 |
|--------|-------|
|
| 22 |
| Total tasks | 152 |
|
| 23 |
+
| Completed | 38 |
|
| 24 |
| Partial / active | 10 |
|
| 25 |
+
| Remaining | 104 |
|
| 26 |
+
| **Completion rate** | **25.00%** |
|
| 27 |
|
| 28 |
### Completion by Person
|
| 29 |
|
| 30 |
| Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
|
| 31 |
|--------|----------|----------------|----------------------|-----------|------|
|
| 32 |
+
| Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 20 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `AGT 05` done by Person B) | 28 | 42.86% |
|
| 33 |
+
| Person B (Ayush) | 29 (27 solo + 2 shared with A) | 10 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 11`) | 0 | 19 | 34.48% |
|
| 34 |
+
| Max (Person C) | 41 | 1 (`FND 11`) | 7 (`FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 12` done by others) | 33 | 19.51% |
|
| 35 |
| Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
|
| 36 |
+
| All (shared) | 3 | 2 (`FND 08`, `AGT 05`) | 0 | 1 | 66.67% |
|
| 37 |
+
|
| 38 |
+
Note: Person B (Ayush) has completed two shared tasks in their own lane
|
| 39 |
+
(`FND 08`, `AGT 05`) plus eight solo tasks in their own lane (`MOD 09`,
|
| 40 |
+
`SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 06`, `AGT 07`, `AGT 11`), and has also executed twenty-five tasks outside their assigned
|
| 41 |
+
ownership (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`,
|
| 42 |
+
`FND 09`, `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`,
|
| 43 |
+
`MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`) to keep the Kian, Max, and Kush
|
| 44 |
+
dependency chain moving. Ayush now has one fully unblocked implementation
|
| 45 |
+
task available: `AGT 03`, with `AGT 10` reduced to a single remaining
|
| 46 |
+
external dependency on `JDG 06`.
|
| 47 |
|
| 48 |
---
|
| 49 |
|
|
|
|
| 52 |
| ID | Assigned To | Current Status | Remaining Acceptance Item |
|
| 53 |
|----|-------------|----------------|---------------------------|
|
| 54 |
| API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the stub env | Real env dependency and task-owner sign-off |
|
| 55 |
+
| API 02 | Max (Person C) | `/reset` works locally against the stub env and now seeds normalized math / ML / finance scenarios through the shared generator | Real env reset dependency and task-owner sign-off |
|
| 56 |
| API 03 | Max (Person C) | `/step` works locally against the stub env | Real env step dependency and task-owner sign-off |
|
| 57 |
+
| API 04 | Max (Person C) | `/scenarios` returns the normalized scenario-family list from the shared generator | Real env exposure and task-owner sign-off |
|
| 58 |
+
| API 06 | Max (Person C) | WebSocket reset, ping, and step work locally against the stub env, including normalized scenario-family resets | Real env integration and task-owner sign-off |
|
| 59 |
| API 07 | Max (Person C) | Idle timeout and cleanup logic exist in the WebSocket path | Real env disconnect cleanup verification |
|
| 60 |
| API 08 | Max (Person C) | `server/Dockerfile` exists | Local Docker build and run verification |
|
| 61 |
| API 13 | Max (Person C) | CORS middleware exists for dev and hosted origins | Frontend integration verification |
|
|
|
|
| 81 |
| MOD 01 | E02 | Person A | Implement `ScientistAction` schema | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Replaced the `ScientistAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so `accept` preserves the current protocol. | Valid scientist actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` and a stub-env `ScientistAction.model_validate(...)` smoke step |
|
| 82 |
| MOD 02 | E02 | Person A | Implement `LabManagerAction` schema | `replicalab/models.py`, `tests/test_models.py` | 2026-03-08 | Replaced the `LabManagerAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests. | Valid lab manager actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` |
|
| 83 |
| MOD 03 | E02 | Person A | Implement role specific observation models | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Added typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. | Scientist and lab observations serialize to JSON with stable keys | Yes - verified with `python -m pytest tests/test_models.py` and a stub `reset()` / `step()` JSON smoke test |
|
| 84 |
+
| MOD 04 | E02 | Person A | Implement `EpisodeState` and `EpisodeLog` models | `replicalab/models.py`, `server/app.py`, `tests/test_models.py` | 2026-03-08 | Replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs. | Full state round trip serialize plus deserialize works | Yes - verified with `python -m pytest tests/test_models.py` |
|
| 85 |
+
| MOD 05 | E02 | Person A | Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab | `replicalab/utils/validation.py`, `tests/test_models.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic semantic protocol validation with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack. | Invalid protocol examples are rejected with readable reasons | Yes - verified with `python -m pytest tests/test_models.py tests/test_scenarios.py` |
|
| 86 |
+
| MOD 11 | E02 | Person A | Implement `StepResult` model | `replicalab/models.py`, `server/app.py`, `tests/test_models.py` | 2026-03-08 | Added typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly. | Step result serializes cleanly and all consumers agree on its shape | Yes - verified with `python -m pytest tests/test_models.py` |
|
| 87 |
+
| MOD 12 | E02 | Person A | Create environment configuration module with shared constants | `replicalab/config.py`, `server/app.py`, `replicalab/scenarios/*.py`, `tests/test_config.py` | 2026-03-08 | Added a shared configuration module for default scenario and difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, then updated the server and scenario builders to import those constants instead of repeating literals. | All modules import config from one place and no magic numbers remain in env or scoring code | Yes - verified with `python -m pytest tests/test_config.py tests/test_scenarios.py` |
|
| 88 |
+
| SCN 01 | E03 | Person A | Implement deterministic RNG helper `seed_rng()` | `replicalab/utils/seed.py`, `replicalab/scenarios/templates.py` | 2026-03-08 | Added deterministic seed helpers that derive reproducible RNG namespaces for scenario generation. | Same seed always yields the same random choices and the seed utility is importable from scenarios and env | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 89 |
+
| SCN 02 | E03 | Person A | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | `replicalab/scenarios/templates.py` | 2026-03-08 | Added `NormalizedScenarioPack` plus strict `ScenarioConstraint`, `ScenarioResource`, `AllowedSubstitution`, and `HiddenReferenceSpec` models to standardize all scenario families. | All scenario builders return the same normalized top-level structure and mapper-ready inputs | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 90 |
+
| SCN 03 | E03 | Person A | Implement mathematics template | `replicalab/scenarios/math_reasoning.py` | 2026-03-08 | Added deterministic mathematics planning templates covering theorem, proof-goal, review, and time constraints. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 91 |
+
| SCN 04 | E03 | Person A | Implement ML benchmark template | `replicalab/scenarios/ml_benchmark.py` | 2026-03-08 | Added deterministic ML benchmark templates covering dataset, compute, time, and evaluation constraints. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 92 |
+
| SCN 05 | E03 | Person A | Implement finance and trading planning template | `replicalab/scenarios/finance_trading.py` | 2026-03-08 | Added deterministic offline finance and trading planning templates covering capital, drawdown, slippage, and backtest rules. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 93 |
+
| SCN 06 | E03 | Person A | Implement difficulty application for easy, medium, hard | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added mechanical difficulty scaling that adjusts budgets, time, staff, resource availability, and injected conflict constraints across easy, medium, and hard. | Difficulty visibly changes the normalized scenario pack in a meaningful way | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 94 |
+
| SCN 07 | E03 | Person A | Implement normalized constraint and resource generator | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added normalized constraint and resource mapping into role-specific observations with consistency checks for unique keys and non-contradictory generated packs. | No generated scenario contains contradictory constraints or resources | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 95 |
+
| SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 96 |
+
| SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
|
| 97 |
+
| SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 98 |
+
|
| 99 |
+
### Person B (Ayush) - Completed own tasks
|
| 100 |
+
|
| 101 |
+
| ID | Epic | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
|
| 102 |
+
|----|------|------|-------------|------|---------------|--------------------|---------|
|
| 103 |
+
| MOD 09 | E02 | Add output parser that maps model text to `ScientistAction` | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added a raw-text parser that extracts JSON from plain output, fenced blocks, or prose-wrapped objects, validates it into `ScientistAction`, and raises explicit `ScientistOutputParseError` values for missing JSON, invalid JSON, or schema failures. | Parser returns structured action or explicit parse error | Yes - verified with `python -m pytest tests/test_scientist_policy.py tests/test_models.py` and a direct `parse_scientist_output(...)` smoke check |
|
| 104 |
+
| SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
|
| 105 |
+
| AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
|
| 106 |
+
| AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
|
| 107 |
+
| AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic non-LLM Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
|
| 108 |
+
| AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
|
| 109 |
+
| AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
|
| 110 |
+
| AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
|
| 111 |
+
| AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md` | 2026-03-08 | Recorded `Qwen3-4B` as the primary Scientist training model with `Qwen3-8B` as the H100-only stretch fallback, and surfaced the decision in the README so the training path uses one canonical model choice. | Decision is recorded and all team members know which model will be fine tuned | Yes - verified by the decision record and README update |
|
| 112 |
+
|
| 113 |
+
### Kush (Person D) - Completed on behalf of others
|
| 114 |
+
|
| 115 |
+
| ID | Epic | Assigned To | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
|
| 116 |
+
|----|------|------------|------|-------------|------|---------------|--------------------|---------|
|
| 117 |
+
| FND 03 | E01 | Max (Person C) | Initialize React plus Vite frontend shell | `frontend/package.json`, `frontend/src/`, `frontend/public/` | 2026-03-08 | Imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, component library, assets, and TypeScript config. | `npm install` and dev server run successfully | Yes - verified with `npm --prefix frontend install` and `npm --prefix frontend run build` |
|
| 118 |
+
| FND 12 | E01 | Max (Person C) | Create Vite config with API and WebSocket proxy support plus stable build output settings | `frontend/vite.config.ts` | 2026-03-08 | Imported Kush's Vite configuration with `@` alias plus `/api` and `/ws` proxy rules, then verified the frontend builds successfully against that config on `ayush`. | Frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | Yes - verified with `npm --prefix frontend run build` |
|
| 119 |
|
| 120 |
### Shared Tasks - Completed
|
| 121 |
|
|
|
|
| 143 |
|---------------|-------------------|
|
| 144 |
| FND 01 | FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10 |
|
| 145 |
| FND 02 | FND 11 |
|
| 146 |
+
| FND 03 | FND 12, FND 13, UI 01 |
|
| 147 |
| FND 04 | FND 08, FND 09 |
|
| 148 |
| FND 05 | No downstream dependencies |
|
| 149 |
| FND 06 | DOC 01 |
|
|
|
|
| 152 |
| FND 09 | OpenEnv registration layer is now present for later `/web` and deployment work |
|
| 153 |
| FND 10 | No downstream dependencies |
|
| 154 |
| FND 11 | No new formal dependencies, but server scaffold work can now install from a standalone requirements file |
|
| 155 |
+
| FND 12 | Frontend dev proxying is now configured for local API and WebSocket work |
|
| 156 |
| MOD 01 | MOD 05, MOD 09 |
|
| 157 |
| MOD 02 | No new formal dependencies, but the Lab Manager contract is now stable for later policy work |
|
| 158 |
| MOD 03 | MOD 04, MOD 11 |
|
| 159 |
+
| MOD 04 | MOD 07, ENV 01 |
|
| 160 |
+
| MOD 05 | MOD 06, AGT 05 |
|
| 161 |
+
| MOD 11 | No new formal dependency edge by itself, but `StepResult` metadata is now stable for environment, API, replay, and training consumers |
|
| 162 |
+
| MOD 12 | Shared defaults now come from `replicalab/config.py`, reducing config drift before environment and scoring work expands |
|
| 163 |
+
| SCN 01 | SCN 09 now has a deterministic seed utility to build on |
|
| 164 |
+
| SCN 02 | SCN 03, SCN 04, SCN 05, SCN 07 |
|
| 165 |
+
| SCN 03 | SCN 06, SCN 08 |
|
| 166 |
+
| SCN 04 | SCN 06, SCN 08 |
|
| 167 |
+
| SCN 05 | SCN 06, SCN 08 |
|
| 168 |
+
| SCN 06 | Harder scenario variants and curriculum-ready difficulty scaling now exist |
|
| 169 |
+
| SCN 07 | `AGT 05` is complete; `AGT 06`, `AGT 07`, `JDG 02`, and `SCN 13` are now unblocked from the normalized resource layer |
|
| 170 |
+
| SCN 08 | `AGT 06` is now unblocked; `JDG 01` and `JDG 03` are also unblocked |
|
| 171 |
+
| SCN 09 | SCN 10, SCN 11, ENV 01, ENV 02 |
|
| 172 |
+
| SCN 10 | Scenario determinism and consistency now have regression coverage |
|
| 173 |
+
| SCN 11 | AGT 01, TRN 08 |
|
| 174 |
+
| MOD 09 | Together with completed `AGT 02`, `AGT 03` is now unblocked |
|
| 175 |
+
| AGT 01 | AGT 02, AGT 11, TRN 04 |
|
| 176 |
+
| AGT 02 | AGT 03, AGT 04 |
|
| 177 |
+
| AGT 04 | Removes the last baseline-policy blocker; `AGT 08` now only waits on `AGT 03` |
|
| 178 |
+
| AGT 05 | AGT 06, AGT 07, JDG 02 |
|
| 179 |
+
| AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
|
| 180 |
+
| AGT 07 | `AGT 10` now only waits on `JDG 06`, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
|
| 181 |
+
| AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
|
| 182 |
|
| 183 |
### Current Unblocked and Active Tasks
|
| 184 |
|
| 185 |
| ID | Owner | Task | Unblocked By |
|
| 186 |
|----|-------|------|-------------|
|
| 187 |
+
| FND 13 | Kush (Person D) | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 |
|
| 188 |
+
| UI 01 | Kush (Person D) | Create application shell with three panel layout | FND 03 |
|
| 189 |
+
| AGT 03 | Person B (Ayush) | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 |
|
| 190 |
+
| MOD 06 | Kian (Person A) | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 |
|
| 191 |
+
| MOD 07 | Max (Person C) | Add state serialization helper for replay logs | MOD 04 |
|
| 192 |
+
| JDG 01 | Kian (Person A) | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | SCN 08 |
|
| 193 |
+
| JDG 02 | Kian (Person A) | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | SCN 07, AGT 05 |
|
| 194 |
+
| JDG 03 | Kian (Person A) | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | SCN 08 |
|
| 195 |
+
| SCN 13 | Kian (Person A) | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 |
|
| 196 |
+
| ENV 01 | Kian (Person A) | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 |
|
| 197 |
| DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
|
| 198 |
|
| 199 |
---
|
|
|
|
| 202 |
|
| 203 |
| Epic | Total Tasks | Completed | Rate |
|
| 204 |
|------|------------|-----------|------|
|
| 205 |
+
| E01. Foundations and repository setup | 13 | 12 | 92.31% |
|
| 206 |
+
| E02. Domain models, validation, state contracts | 12 | 8 | 66.67% |
|
| 207 |
+
| E03. Scenario engine and constraint generation | 13 | 11 | 84.62% |
|
| 208 |
+
| E04. Scientist agent and Lab Manager policy | 11 | 7 | 63.64% |
|
| 209 |
| E05. Judge engine and reward logic | 11 | 0 | 0% |
|
| 210 |
| E06. OpenEnv environment implementation | 11 | 0 | 0% |
|
| 211 |
| E07. API, server, Docker, deployment | 19 | 0 | 0% |
|
docs/kian/task_breakdown.md
CHANGED
|
@@ -6,32 +6,28 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 6 |
|
| 7 |
## Current status
|
| 8 |
|
| 9 |
-
- `FND 04` complete
|
| 10 |
-
- `
|
| 11 |
-
- `
|
| 12 |
-
-
|
| 13 |
-
-
|
| 14 |
-
- `MOD 03` complete
|
| 15 |
-
- The Kian lane should now move to config, scenario seeding, validation, the remaining state-model pass, and the normalized scenario layer
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
## Recommended execution order
|
| 20 |
|
| 21 |
-
1. `MOD
|
| 22 |
-
2. `SCN
|
| 23 |
-
3. `
|
| 24 |
-
4. `
|
| 25 |
-
5. `
|
| 26 |
-
6. `MOD 11` -- finalizes `StepResult` now that the observation wrapper is typed
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
## Why this order
|
| 31 |
|
| 32 |
-
- `MOD
|
| 33 |
-
- `
|
| 34 |
-
- `
|
| 35 |
-
- `
|
| 36 |
-
-
|
| 37 |
-
|
|
|
|
| 6 |
|
| 7 |
## Current status
|
| 8 |
|
| 9 |
+
- `FND 04`, `FND 08`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
|
| 10 |
+
- Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the later judge feasibility score
|
| 11 |
+
- `SCN 01` to `SCN 10` are also complete, so the deterministic scenario layer now exists in code
|
| 12 |
+
- The Kian lane no longer needs to start with scenario seeding or template scaffolding
|
| 13 |
+
- The remaining high-leverage work is semantic edge-case validation, booking conflicts, judge logic, and the real environment
|
|
|
|
|
|
|
| 14 |
|
| 15 |
---
|
| 16 |
|
| 17 |
## Recommended execution order
|
| 18 |
|
| 19 |
+
1. `MOD 06` -- extend the new semantic validation layer to catch impossible edge cases early
|
| 20 |
+
2. `SCN 13` -- deepen the normalized scenario layer with booking and scheduling conflicts
|
| 21 |
+
3. `JDG 01`, `JDG 02`, and `JDG 03` -- start the deterministic reward components that are now unblocked
|
| 22 |
+
4. `JDG 04` and `JDG 05` -- complete the reward pipeline once the component scorers exist
|
| 23 |
+
5. `ENV 01` and `ENV 02` -- once typed state and core scoring pieces are in place, start the real OpenEnv environment path
|
|
|
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
## Why this order
|
| 28 |
|
| 29 |
+
- `MOD 06` is the smallest remaining contract-hardening task and builds directly on the completed `MOD 05` validator.
|
| 30 |
+
- `SCN 13` is the remaining scenario-layer depth task; it builds naturally on the completed normalized resource model.
|
| 31 |
+
- `JDG 01` and `JDG 03` can start immediately because their only formal prerequisite, `SCN 08`, is already complete.
|
| 32 |
+
- `JDG 02` is now also unblocked because the deterministic feasibility checker from `AGT 05` exists.
|
| 33 |
+
- The environment path can now start from typed state and step-result contracts instead of loose dict-based placeholders.
|
|
|
docs/kian/task_list.md
CHANGED
|
@@ -6,29 +6,28 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 6 |
|
| 7 |
## Current status
|
| 8 |
|
| 9 |
-
- `FND 04`
|
| 10 |
-
- `
|
| 11 |
-
- `
|
| 12 |
-
- `
|
| 13 |
-
-
|
| 14 |
-
- `MOD
|
| 15 |
-
-
|
| 16 |
-
- `SCN 02` now needs to formalize the normalized scenario pack below the stable outer contract
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
## Immediate next tasks
|
| 21 |
|
| 22 |
-
- [ ] **MOD
|
| 23 |
-
- [ ] **SCN
|
| 24 |
-
- [ ] **
|
| 25 |
-
- [ ] **
|
| 26 |
-
- [ ] **
|
| 27 |
-
- [ ] **
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
-
## Foundation tasks already landed
|
| 32 |
|
| 33 |
- [x] **FND 04** | Completed by Person B (Ayush)
|
| 34 |
- [x] **FND 08** | Completed with shared sign-off
|
|
@@ -36,4 +35,18 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 36 |
- [x] **MOD 01** | Completed by Person B (Ayush)
|
| 37 |
- [x] **MOD 02** | Completed by Person B (Ayush)
|
| 38 |
- [x] **MOD 03** | Completed by Person B (Ayush)
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## Current status
|
| 8 |
|
| 9 |
+
- `FND 04`, `FND 08`, and `FND 09` are complete
|
| 10 |
+
- `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
|
| 11 |
+
- Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
|
| 12 |
+
- `SCN 01` to `SCN 10` are now complete in the repo
|
| 13 |
+
- The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
|
| 14 |
+
- The next Kian-lane tasks are now `MOD 06`, `SCN 13`, `JDG 01`, `JDG 02`, `JDG 03`, and `ENV 01`
|
| 15 |
+
- `MOD 05` and shared `AGT 05` now exist, so the judge and environment path can build on real scenario-grounded checks instead of placeholder rules
|
|
|
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
## Immediate next tasks
|
| 20 |
|
| 21 |
+
- [ ] **MOD 06** | Add semantic validators for impossible plans such as zero sample size with positive controls | 0.75h | Depends: MOD 05
|
| 22 |
+
- [ ] **SCN 13** | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time-slot conflicts and duration | 1h | Depends: SCN 07
|
| 23 |
+
- [ ] **JDG 01** | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | 1.25h | Depends: SCN 08
|
| 24 |
+
- [ ] **JDG 02** | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | 1.25h | Depends: SCN 07, AGT 05
|
| 25 |
+
- [ ] **JDG 03** | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | 1h | Depends: SCN 08
|
| 26 |
+
- [ ] **ENV 01** | Create `ReplicaLabEnv` class skeleton | 0.5h | Depends: MOD 04, SCN 09
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
+
## Foundation and scenario tasks already landed
|
| 31 |
|
| 32 |
- [x] **FND 04** | Completed by Person B (Ayush)
|
| 33 |
- [x] **FND 08** | Completed with shared sign-off
|
|
|
|
| 35 |
- [x] **MOD 01** | Completed by Person B (Ayush)
|
| 36 |
- [x] **MOD 02** | Completed by Person B (Ayush)
|
| 37 |
- [x] **MOD 03** | Completed by Person B (Ayush)
|
| 38 |
+
- [x] **MOD 04** | Completed by Person B (Ayush)
|
| 39 |
+
- [x] **MOD 05** | Completed by Person B (Ayush)
|
| 40 |
+
- [x] **MOD 11** | Completed by Person B (Ayush)
|
| 41 |
+
- [x] **MOD 12** | Completed by Person B (Ayush)
|
| 42 |
+
- [x] **AGT 05** | Completed by Person B (Ayush)
|
| 43 |
+
- [x] **SCN 01** | Completed by Person B (Ayush)
|
| 44 |
+
- [x] **SCN 02** | Completed by Person B (Ayush)
|
| 45 |
+
- [x] **SCN 03** | Completed by Person B (Ayush)
|
| 46 |
+
- [x] **SCN 04** | Completed by Person B (Ayush)
|
| 47 |
+
- [x] **SCN 05** | Completed by Person B (Ayush)
|
| 48 |
+
- [x] **SCN 06** | Completed by Person B (Ayush)
|
| 49 |
+
- [x] **SCN 07** | Completed by Person B (Ayush)
|
| 50 |
+
- [x] **SCN 08** | Completed by Person B (Ayush)
|
| 51 |
+
- [x] **SCN 09** | Completed by Person B (Ayush)
|
| 52 |
+
- [x] **SCN 10** | Completed by Person B (Ayush)
|
docs/kush/task_breakdown.md
CHANGED
|
@@ -8,12 +8,12 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 8 |
|
| 9 |
- `FND 06` is already complete and was executed by `Person B (Ayush)`
|
| 10 |
- The README stub now exists, so the next unblocked documentation task is `DOC 01`
|
| 11 |
-
-
|
|
|
|
| 12 |
|
| 13 |
---
|
| 14 |
|
| 15 |
## Execution order
|
| 16 |
|
| 17 |
-
1. `DOC 01`
|
| 18 |
-
2. `FND 13`
|
| 19 |
-
|
|
|
|
| 8 |
|
| 9 |
- `FND 06` is already complete and was executed by `Person B (Ayush)`
|
| 10 |
- The README stub now exists, so the next unblocked documentation task is `DOC 01`
|
| 11 |
+
- The validated frontend import from Kush's branch is now on `ayush`
|
| 12 |
+
- `FND 13` is now unblocked because `FND 03` is complete
|
| 13 |
|
| 14 |
---
|
| 15 |
|
| 16 |
## Execution order
|
| 17 |
|
| 18 |
+
1. `DOC 01` - improve the README opening now that the temporary stub exists
|
| 19 |
+
2. `FND 13` - reconcile the Tailwind plus base style pipeline with the imported frontend and the current source-of-truth file layout
|
|
|
docs/kush/task_list.md
CHANGED
|
@@ -8,12 +8,13 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 8 |
|
| 9 |
- `FND 06` is complete and was executed by `Person B (Ayush)`
|
| 10 |
- `DOC 01` is now unblocked because `FND 06` is complete
|
| 11 |
-
-
|
|
|
|
| 12 |
|
| 13 |
---
|
| 14 |
|
| 15 |
## Immediate next tasks
|
| 16 |
|
| 17 |
- [ ] **DOC 01** | Write hook, problem statement, and one-line product summary | Depends: `FND 06` | Status: unblocked
|
| 18 |
-
- [ ] **FND 13** | Configure Tailwind and base styling pipeline | Depends: `FND 03` | Status:
|
| 19 |
|
|
|
|
| 8 |
|
| 9 |
- `FND 06` is complete and was executed by `Person B (Ayush)`
|
| 10 |
- `DOC 01` is now unblocked because `FND 06` is complete
|
| 11 |
+
- The frontend shell from Kush's branch is now on `ayush` and builds successfully
|
| 12 |
+
- `FND 13` is now unblocked because `FND 03` is complete
|
| 13 |
|
| 14 |
---
|
| 15 |
|
| 16 |
## Immediate next tasks
|
| 17 |
|
| 18 |
- [ ] **DOC 01** | Write hook, problem statement, and one-line product summary | Depends: `FND 06` | Status: unblocked
|
| 19 |
+
- [ ] **FND 13** | Configure Tailwind and base styling pipeline | Depends: `FND 03` | Status: unblocked
|
| 20 |
|
docs/max/task_breakdown.md
CHANGED
|
@@ -8,24 +8,24 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 8 |
|
| 9 |
- `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are already complete
|
| 10 |
- Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
|
|
|
|
| 11 |
- `FND 11` is now complete and verified
|
| 12 |
- A normalized backend import from Max's PR is on `ayush`: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md`
|
| 13 |
- That backend import is intentionally tracked as partial because it still runs on the stub env and Docker has not yet been validated locally
|
| 14 |
-
- Max's
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
## Unblocked now
|
| 19 |
|
| 20 |
-
1.
|
| 21 |
-
2.
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
## Still blocked
|
| 26 |
|
| 27 |
-
- `FND
|
| 28 |
-
- `FND 13` depends on `FND 03` even though it is owned by Kush (Person D)
|
| 29 |
- Real completion of `API 01`, `API 02`, `API 03`, `API 06`, and `API 07` depends on Kian's environment tasks
|
| 30 |
- Real completion of `API 08` depends on local Docker build and run validation
|
| 31 |
|
|
@@ -33,9 +33,6 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 33 |
|
| 34 |
## Recommended execution order
|
| 35 |
|
| 36 |
-
1.
|
| 37 |
-
2.
|
| 38 |
-
3.
|
| 39 |
-
4. Validate `server/Dockerfile` locally
|
| 40 |
-
5. Continue into deployment and replay work once the real env path is stable
|
| 41 |
-
|
|
|
|
| 8 |
|
| 9 |
- `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are already complete
|
| 10 |
- Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
|
| 11 |
+
- `FND 03` and `FND 12` are now complete via the validated frontend import from Kush's branch onto `ayush`
|
| 12 |
- `FND 11` is now complete and verified
|
| 13 |
- A normalized backend import from Max's PR is on `ayush`: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md`
|
| 14 |
- That backend import is intentionally tracked as partial because it still runs on the stub env and Docker has not yet been validated locally
|
| 15 |
+
- Max's remaining implementation priority is the real-env-backed API and deployment path
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
## Unblocked now
|
| 20 |
|
| 21 |
+
1. Convert the stub-backed API tasks to real-env-backed implementations once Kian lands the environment work
|
| 22 |
+
2. Validate Docker locally once the real env path is in place
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
## Still blocked
|
| 27 |
|
| 28 |
+
- `FND 13` is now unblocked because `FND 03` is complete, but it remains owned by Kush (Person D)
|
|
|
|
| 29 |
- Real completion of `API 01`, `API 02`, `API 03`, `API 06`, and `API 07` depends on Kian's environment tasks
|
| 30 |
- Real completion of `API 08` depends on local Docker build and run validation
|
| 31 |
|
|
|
|
| 33 |
|
| 34 |
## Recommended execution order
|
| 35 |
|
| 36 |
+
1. Re-validate the imported server scaffold against Kian's environment implementation
|
| 37 |
+
2. Validate `server/Dockerfile` locally
|
| 38 |
+
3. Continue into deployment and replay work once the real env path is stable
|
|
|
|
|
|
|
|
|
docs/max/task_list.md
CHANGED
|
@@ -8,17 +8,17 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 8 |
|
| 9 |
- `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are complete
|
| 10 |
- All five were executed by `Person B (Ayush)` and recorded as executor deviations
|
|
|
|
| 11 |
- `FND 11` is complete
|
|
|
|
| 12 |
- A stub-backed backend server scaffold now exists in `server/app.py`
|
| 13 |
- `API 01`, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 13`, `API 14`, and `OBS 02` are partial pending real-env and Docker-level verification
|
| 14 |
-
- The
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
## Immediate next tasks
|
| 19 |
|
| 20 |
-
- [ ] **FND 03** | Initialize React plus Vite frontend shell | Depends: `FND 01` | Status: unblocked
|
| 21 |
-
- [ ] **FND 12** | Create `frontend/vite.config.ts` with proxy settings | Depends: `FND 03` | Status: blocked
|
| 22 |
- [ ] **API 01 / API 02 / API 03 / API 06** | Convert the stub-backed server scaffold into real-env-backed endpoints | Depends: `ENV 01`, `ENV 02`, `ENV 06` | Status: partial
|
| 23 |
- [ ] **API 08** | Validate Docker locally for the server image | Depends: `API 01` to `API 07` | Status: partial
|
| 24 |
- [ ] **OBS 02** | Confirm logging behavior against the integrated environment path | Depends: `API 01` | Status: partial
|
|
@@ -29,9 +29,11 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 29 |
|
| 30 |
- [x] **FND 01** | Completed by Person B (Ayush)
|
| 31 |
- [x] **FND 02** | Completed by Person B (Ayush)
|
|
|
|
| 32 |
- [x] **FND 05** | Completed by Person B (Ayush)
|
| 33 |
- [x] **FND 07** | Completed by Person B (Ayush)
|
| 34 |
- [x] **FND 10** | Completed by Person B (Ayush)
|
|
|
|
| 35 |
|
| 36 |
## Completed in Max's lane
|
| 37 |
|
|
|
|
| 8 |
|
| 9 |
- `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are complete
|
| 10 |
- All five were executed by `Person B (Ayush)` and recorded as executor deviations
|
| 11 |
+
- `FND 03` is complete via the validated frontend import from Kush's branch onto `ayush`
|
| 12 |
- `FND 11` is complete
|
| 13 |
+
- `FND 12` is complete via the imported and validated `frontend/vite.config.ts`
|
| 14 |
- A stub-backed backend server scaffold now exists in `server/app.py`
|
| 15 |
- `API 01`, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 13`, `API 14`, and `OBS 02` are partial pending real-env and Docker-level verification
|
| 16 |
+
- The remaining Max work is now the API, Docker, deployment, replay, and observability path
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
## Immediate next tasks
|
| 21 |
|
|
|
|
|
|
|
| 22 |
- [ ] **API 01 / API 02 / API 03 / API 06** | Convert the stub-backed server scaffold into real-env-backed endpoints | Depends: `ENV 01`, `ENV 02`, `ENV 06` | Status: partial
|
| 23 |
- [ ] **API 08** | Validate Docker locally for the server image | Depends: `API 01` to `API 07` | Status: partial
|
| 24 |
- [ ] **OBS 02** | Confirm logging behavior against the integrated environment path | Depends: `API 01` | Status: partial
|
|
|
|
| 29 |
|
| 30 |
- [x] **FND 01** | Completed by Person B (Ayush)
|
| 31 |
- [x] **FND 02** | Completed by Person B (Ayush)
|
| 32 |
+
- [x] **FND 03** | Completed by Kush and imported onto `ayush`
|
| 33 |
- [x] **FND 05** | Completed by Person B (Ayush)
|
| 34 |
- [x] **FND 07** | Completed by Person B (Ayush)
|
| 35 |
- [x] **FND 10** | Completed by Person B (Ayush)
|
| 36 |
+
- [x] **FND 12** | Completed by Kush and imported onto `ayush`
|
| 37 |
|
| 38 |
## Completed in Max's lane
|
| 39 |
|
replicalab/agents/__init__.py
CHANGED
|
@@ -5,6 +5,7 @@ from .lab_manager_policy import (
|
|
| 5 |
FeasibilityCheckResult,
|
| 6 |
SuggestionChange,
|
| 7 |
check_feasibility,
|
|
|
|
| 8 |
suggest_alternative,
|
| 9 |
)
|
| 10 |
from .scientist_policy import (
|
|
@@ -29,6 +30,7 @@ __all__ = [
|
|
| 29 |
"build_scientist_system_prompt",
|
| 30 |
"call_scientist_with_retry",
|
| 31 |
"check_feasibility",
|
|
|
|
| 32 |
"format_scientist_observation",
|
| 33 |
"parse_scientist_output",
|
| 34 |
"suggest_alternative",
|
|
|
|
| 5 |
FeasibilityCheckResult,
|
| 6 |
SuggestionChange,
|
| 7 |
check_feasibility,
|
| 8 |
+
compose_lab_manager_response,
|
| 9 |
suggest_alternative,
|
| 10 |
)
|
| 11 |
from .scientist_policy import (
|
|
|
|
| 30 |
"build_scientist_system_prompt",
|
| 31 |
"call_scientist_with_retry",
|
| 32 |
"check_feasibility",
|
| 33 |
+
"compose_lab_manager_response",
|
| 34 |
"format_scientist_observation",
|
| 35 |
"parse_scientist_output",
|
| 36 |
"suggest_alternative",
|
replicalab/agents/lab_manager_policy.py
CHANGED
|
@@ -5,15 +5,19 @@ normalized scenario pack and returns stable pass/fail status per dimension.
|
|
| 5 |
AGT 06 adds ``suggest_alternative`` which mechanically applies substitution
|
| 6 |
rules, clamps duration, and reduces sample size to produce a concrete
|
| 7 |
revised protocol with a post-fix feasibility recheck.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
-
from typing import Optional
|
| 13 |
|
| 14 |
from pydantic import BaseModel, ConfigDict, Field, computed_field
|
| 15 |
|
| 16 |
-
from replicalab.models import Protocol
|
| 17 |
from replicalab.scenarios import NormalizedScenarioPack
|
| 18 |
from replicalab.utils.validation import ValidationResult, validate_protocol
|
| 19 |
|
|
@@ -182,6 +186,12 @@ class AlternativeSuggestion(BaseModel):
|
|
| 182 |
post_check: FeasibilityCheckResult
|
| 183 |
|
| 184 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
def suggest_alternative(
|
| 186 |
protocol: Protocol,
|
| 187 |
check_result: FeasibilityCheckResult,
|
|
@@ -300,6 +310,55 @@ def suggest_alternative(
|
|
| 300 |
)
|
| 301 |
|
| 302 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 303 |
def _apply_substitutions(
|
| 304 |
items: list[str],
|
| 305 |
substitution_options: dict[str, list[str]],
|
|
@@ -385,6 +444,125 @@ def _build_tradeoff_index(scenario: NormalizedScenarioPack) -> dict[str, str]:
|
|
| 385 |
return index
|
| 386 |
|
| 387 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 388 |
def _build_protocol_check(validation_result: ValidationResult) -> DimensionCheck:
|
| 389 |
reasons = [issue.message for issue in validation_result.issues]
|
| 390 |
return DimensionCheck(ok=validation_result.valid, reasons=reasons)
|
|
|
|
| 5 |
AGT 06 adds ``suggest_alternative`` which mechanically applies substitution
|
| 6 |
rules, clamps duration, and reduces sample size to produce a concrete
|
| 7 |
revised protocol with a post-fix feasibility recheck.
|
| 8 |
+
AGT 07 adds ``compose_lab_manager_response`` which converts those grounded
|
| 9 |
+
results into a typed ``LabManagerAction`` with stable flags plus a readable
|
| 10 |
+
explanation. An optional explanation renderer can add richer language later
|
| 11 |
+
without taking over the verdict or constraint fields.
|
| 12 |
"""
|
| 13 |
|
| 14 |
from __future__ import annotations
|
| 15 |
|
| 16 |
+
from typing import Callable, Optional
|
| 17 |
|
| 18 |
from pydantic import BaseModel, ConfigDict, Field, computed_field
|
| 19 |
|
| 20 |
+
from replicalab.models import LabManagerAction, LabManagerActionType, Protocol
|
| 21 |
from replicalab.scenarios import NormalizedScenarioPack
|
| 22 |
from replicalab.utils.validation import ValidationResult, validate_protocol
|
| 23 |
|
|
|
|
| 186 |
post_check: FeasibilityCheckResult
|
| 187 |
|
| 188 |
|
| 189 |
+
ExplanationRenderer = Callable[
|
| 190 |
+
[LabManagerActionType, FeasibilityCheckResult, Optional[AlternativeSuggestion]],
|
| 191 |
+
str,
|
| 192 |
+
]
|
| 193 |
+
|
| 194 |
+
|
| 195 |
def suggest_alternative(
|
| 196 |
protocol: Protocol,
|
| 197 |
check_result: FeasibilityCheckResult,
|
|
|
|
| 310 |
)
|
| 311 |
|
| 312 |
|
| 313 |
+
def compose_lab_manager_response(
|
| 314 |
+
check_result: FeasibilityCheckResult,
|
| 315 |
+
suggestion: Optional[AlternativeSuggestion] = None,
|
| 316 |
+
*,
|
| 317 |
+
explanation_renderer: Optional[ExplanationRenderer] = None,
|
| 318 |
+
) -> LabManagerAction:
|
| 319 |
+
"""Compose a grounded ``LabManagerAction`` from deterministic inputs.
|
| 320 |
+
|
| 321 |
+
The verdict and all feasibility flags remain deterministic. Callers may
|
| 322 |
+
optionally inject an ``explanation_renderer`` for richer wording, but it
|
| 323 |
+
never controls the action type or the pass/fail flags.
|
| 324 |
+
"""
|
| 325 |
+
|
| 326 |
+
action_type = _select_lab_manager_action_type(check_result, suggestion)
|
| 327 |
+
explanation = (
|
| 328 |
+
explanation_renderer(action_type, check_result, suggestion)
|
| 329 |
+
if explanation_renderer is not None
|
| 330 |
+
else _build_default_explanation(action_type, check_result, suggestion)
|
| 331 |
+
).strip()
|
| 332 |
+
if not explanation:
|
| 333 |
+
raise ValueError("Lab Manager explanation must be non-empty")
|
| 334 |
+
|
| 335 |
+
suggested_protocol = (
|
| 336 |
+
suggestion.revised_protocol
|
| 337 |
+
if action_type is LabManagerActionType.SUGGEST_ALTERNATIVE and suggestion is not None
|
| 338 |
+
else None
|
| 339 |
+
)
|
| 340 |
+
|
| 341 |
+
return LabManagerAction(
|
| 342 |
+
action_type=action_type,
|
| 343 |
+
feasible=_lab_constraints_feasible(check_result),
|
| 344 |
+
budget_ok=check_result.budget_ok,
|
| 345 |
+
equipment_ok=check_result.equipment_ok,
|
| 346 |
+
reagents_ok=check_result.reagents_ok,
|
| 347 |
+
schedule_ok=check_result.schedule_ok,
|
| 348 |
+
staff_ok=check_result.staff_ok,
|
| 349 |
+
suggested_technique=(
|
| 350 |
+
suggested_protocol.technique if suggested_protocol is not None else ""
|
| 351 |
+
),
|
| 352 |
+
suggested_sample_size=(
|
| 353 |
+
suggested_protocol.sample_size if suggested_protocol is not None else 0
|
| 354 |
+
),
|
| 355 |
+
suggested_controls=(
|
| 356 |
+
list(suggested_protocol.controls) if suggested_protocol is not None else []
|
| 357 |
+
),
|
| 358 |
+
explanation=explanation,
|
| 359 |
+
)
|
| 360 |
+
|
| 361 |
+
|
| 362 |
def _apply_substitutions(
|
| 363 |
items: list[str],
|
| 364 |
substitution_options: dict[str, list[str]],
|
|
|
|
| 444 |
return index
|
| 445 |
|
| 446 |
|
| 447 |
+
def _select_lab_manager_action_type(
|
| 448 |
+
check_result: FeasibilityCheckResult,
|
| 449 |
+
suggestion: Optional[AlternativeSuggestion],
|
| 450 |
+
) -> LabManagerActionType:
|
| 451 |
+
"""Choose the outward action mode from grounded feasibility results."""
|
| 452 |
+
|
| 453 |
+
lab_constraints_ok = _lab_constraints_feasible(check_result)
|
| 454 |
+
|
| 455 |
+
if lab_constraints_ok and check_result.protocol.ok and check_result.policy.ok:
|
| 456 |
+
return LabManagerActionType.ACCEPT
|
| 457 |
+
|
| 458 |
+
if suggestion is not None and suggestion.applied_changes:
|
| 459 |
+
return LabManagerActionType.SUGGEST_ALTERNATIVE
|
| 460 |
+
|
| 461 |
+
if lab_constraints_ok:
|
| 462 |
+
return LabManagerActionType.REPORT_FEASIBILITY
|
| 463 |
+
|
| 464 |
+
return LabManagerActionType.REJECT
|
| 465 |
+
|
| 466 |
+
|
| 467 |
+
def _build_default_explanation(
|
| 468 |
+
action_type: LabManagerActionType,
|
| 469 |
+
check_result: FeasibilityCheckResult,
|
| 470 |
+
suggestion: Optional[AlternativeSuggestion],
|
| 471 |
+
) -> str:
|
| 472 |
+
"""Render a deterministic human-readable explanation."""
|
| 473 |
+
|
| 474 |
+
if action_type is LabManagerActionType.ACCEPT:
|
| 475 |
+
return f"Accepted. {check_result.summary}"
|
| 476 |
+
|
| 477 |
+
if action_type is LabManagerActionType.SUGGEST_ALTERNATIVE and suggestion is not None:
|
| 478 |
+
parts = [
|
| 479 |
+
"Current proposal is not feasible under the present lab constraints.",
|
| 480 |
+
_format_reason_block(check_result, include_protocol=False, include_policy=False),
|
| 481 |
+
"Suggested revision: "
|
| 482 |
+
+ " ".join(_format_change_sentence(change) for change in suggestion.applied_changes),
|
| 483 |
+
]
|
| 484 |
+
if suggestion.remaining_failures:
|
| 485 |
+
parts.append(
|
| 486 |
+
"Remaining issues after the suggested revision: "
|
| 487 |
+
+ ", ".join(suggestion.remaining_failures)
|
| 488 |
+
+ "."
|
| 489 |
+
)
|
| 490 |
+
return " ".join(part for part in parts if part)
|
| 491 |
+
|
| 492 |
+
if action_type is LabManagerActionType.REPORT_FEASIBILITY:
|
| 493 |
+
parts = [
|
| 494 |
+
"Feasibility report: lab resources and schedule are workable, but the current proposal still needs revision.",
|
| 495 |
+
_format_reason_block(check_result, include_protocol=True, include_policy=True),
|
| 496 |
+
]
|
| 497 |
+
return " ".join(part for part in parts if part)
|
| 498 |
+
|
| 499 |
+
parts = [
|
| 500 |
+
"Rejected. No deterministic revision could satisfy the current lab constraints.",
|
| 501 |
+
_format_reason_block(check_result, include_protocol=False, include_policy=False),
|
| 502 |
+
]
|
| 503 |
+
return " ".join(part for part in parts if part)
|
| 504 |
+
|
| 505 |
+
|
| 506 |
+
def _lab_constraints_feasible(check_result: FeasibilityCheckResult) -> bool:
|
| 507 |
+
return all(
|
| 508 |
+
(
|
| 509 |
+
check_result.budget_ok,
|
| 510 |
+
check_result.equipment_ok,
|
| 511 |
+
check_result.reagents_ok,
|
| 512 |
+
check_result.schedule_ok,
|
| 513 |
+
check_result.staff_ok,
|
| 514 |
+
)
|
| 515 |
+
)
|
| 516 |
+
|
| 517 |
+
|
| 518 |
+
def _format_reason_block(
|
| 519 |
+
check_result: FeasibilityCheckResult,
|
| 520 |
+
*,
|
| 521 |
+
include_protocol: bool,
|
| 522 |
+
include_policy: bool,
|
| 523 |
+
) -> str:
|
| 524 |
+
blocks: list[str] = []
|
| 525 |
+
for name, check in _iter_dimension_checks(
|
| 526 |
+
check_result,
|
| 527 |
+
include_protocol=include_protocol,
|
| 528 |
+
include_policy=include_policy,
|
| 529 |
+
):
|
| 530 |
+
if check.ok or not check.reasons:
|
| 531 |
+
continue
|
| 532 |
+
blocks.append(f"{name}: {' '.join(check.reasons)}")
|
| 533 |
+
return " ".join(blocks)
|
| 534 |
+
|
| 535 |
+
|
| 536 |
+
def _format_change_sentence(change: SuggestionChange) -> str:
|
| 537 |
+
return (
|
| 538 |
+
f"{change.field} changed from {change.original} to {change.revised}. "
|
| 539 |
+
f"{change.reason} Tradeoff: {change.tradeoff}"
|
| 540 |
+
)
|
| 541 |
+
|
| 542 |
+
|
| 543 |
+
def _iter_dimension_checks(
|
| 544 |
+
check_result: FeasibilityCheckResult,
|
| 545 |
+
*,
|
| 546 |
+
include_protocol: bool,
|
| 547 |
+
include_policy: bool,
|
| 548 |
+
) -> list[tuple[str, DimensionCheck]]:
|
| 549 |
+
checks: list[tuple[str, DimensionCheck]] = []
|
| 550 |
+
if include_protocol:
|
| 551 |
+
checks.append(("protocol", check_result.protocol))
|
| 552 |
+
checks.extend(
|
| 553 |
+
[
|
| 554 |
+
("budget", check_result.budget),
|
| 555 |
+
("equipment", check_result.equipment),
|
| 556 |
+
("reagents", check_result.reagents),
|
| 557 |
+
("schedule", check_result.schedule),
|
| 558 |
+
("staff", check_result.staff),
|
| 559 |
+
]
|
| 560 |
+
)
|
| 561 |
+
if include_policy:
|
| 562 |
+
checks.append(("policy", check_result.policy))
|
| 563 |
+
return checks
|
| 564 |
+
|
| 565 |
+
|
| 566 |
def _build_protocol_check(validation_result: ValidationResult) -> DimensionCheck:
|
| 567 |
reasons = [issue.message for issue in validation_result.issues]
|
| 568 |
return DimensionCheck(ok=validation_result.valid, reasons=reasons)
|
replicalab/agents/scientist_policy.py
CHANGED
|
@@ -7,6 +7,8 @@ instead of hard-coded domain text. AGT 02 adds the per-turn observation
|
|
| 7 |
formatter that converts a ``ScientistObservation`` into the user message
|
| 8 |
sent to the LLM each round. AGT 03 wraps the formatter and parser in a
|
| 9 |
retry loop with error-specific correction prompts and exposed telemetry.
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
from __future__ import annotations
|
|
@@ -30,6 +32,44 @@ from replicalab.scenarios import NormalizedScenarioPack
|
|
| 30 |
log = logging.getLogger(__name__)
|
| 31 |
|
| 32 |
_JSON_FENCE_RE = re.compile(r"```(?:json)?\s*(.*?)```", re.IGNORECASE | re.DOTALL)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
|
| 35 |
class ScientistOutputParseError(ValueError):
|
|
@@ -301,6 +341,30 @@ def format_scientist_observation(obs: ScientistObservation) -> str:
|
|
| 301 |
return "\n\n".join(sections)
|
| 302 |
|
| 303 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 304 |
def _render_history(entries: list[ConversationEntry]) -> str:
|
| 305 |
lines: list[str] = []
|
| 306 |
for entry in entries:
|
|
@@ -510,3 +574,119 @@ def _render_substitutions(pack: NormalizedScenarioPack) -> str:
|
|
| 510 |
)
|
| 511 |
)
|
| 512 |
return "\n".join(lines)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
formatter that converts a ``ScientistObservation`` into the user message
|
| 8 |
sent to the LLM each round. AGT 03 wraps the formatter and parser in a
|
| 9 |
retry loop with error-specific correction prompts and exposed telemetry.
|
| 10 |
+
AGT 04 adds a deterministic baseline Scientist so smoke tests can run
|
| 11 |
+
without a trained model.
|
| 12 |
"""
|
| 13 |
|
| 14 |
from __future__ import annotations
|
|
|
|
| 32 |
log = logging.getLogger(__name__)
|
| 33 |
|
| 34 |
_JSON_FENCE_RE = re.compile(r"```(?:json)?\s*(.*?)```", re.IGNORECASE | re.DOTALL)
|
| 35 |
+
_ML_HINTS = (
|
| 36 |
+
"benchmark",
|
| 37 |
+
"dataset",
|
| 38 |
+
"accuracy",
|
| 39 |
+
"tokenizer",
|
| 40 |
+
"train",
|
| 41 |
+
"gpu",
|
| 42 |
+
"cifar",
|
| 43 |
+
"ag news",
|
| 44 |
+
"bert",
|
| 45 |
+
"resnet",
|
| 46 |
+
)
|
| 47 |
+
_FINANCE_HINTS = (
|
| 48 |
+
"backtest",
|
| 49 |
+
"drawdown",
|
| 50 |
+
"sharpe",
|
| 51 |
+
"trading",
|
| 52 |
+
"slippage",
|
| 53 |
+
"capital",
|
| 54 |
+
"spy",
|
| 55 |
+
"qqq",
|
| 56 |
+
"futures",
|
| 57 |
+
)
|
| 58 |
+
_BLOCKER_HINTS = (
|
| 59 |
+
"booked",
|
| 60 |
+
"unavailable",
|
| 61 |
+
"not available",
|
| 62 |
+
"exceeds",
|
| 63 |
+
"tight",
|
| 64 |
+
"limited",
|
| 65 |
+
"deadline",
|
| 66 |
+
"budget",
|
| 67 |
+
"cost",
|
| 68 |
+
"drawdown",
|
| 69 |
+
"slippage",
|
| 70 |
+
"risk",
|
| 71 |
+
"conflict",
|
| 72 |
+
)
|
| 73 |
|
| 74 |
|
| 75 |
class ScientistOutputParseError(ValueError):
|
|
|
|
| 341 |
return "\n\n".join(sections)
|
| 342 |
|
| 343 |
|
| 344 |
+
def build_baseline_scientist_action(
|
| 345 |
+
observation: ScientistObservation,
|
| 346 |
+
) -> ScientistAction:
|
| 347 |
+
"""Return a deterministic non-LLM Scientist action for smoke tests.
|
| 348 |
+
|
| 349 |
+
The baseline follows a conservative policy:
|
| 350 |
+
- propose a valid protocol when no protocol exists yet
|
| 351 |
+
- revise the current protocol if the latest Lab Manager message contains
|
| 352 |
+
an obvious feasibility blocker
|
| 353 |
+
- otherwise accept the current protocol to complete the episode cleanly
|
| 354 |
+
"""
|
| 355 |
+
|
| 356 |
+
latest_feedback = _latest_lab_manager_feedback(observation)
|
| 357 |
+
|
| 358 |
+
if observation.current_protocol is not None:
|
| 359 |
+
if observation.round_number >= max(1, observation.max_rounds - 1):
|
| 360 |
+
return _build_accept_action()
|
| 361 |
+
if latest_feedback and _feedback_requires_revision(latest_feedback.message):
|
| 362 |
+
return _build_revision_action(observation.current_protocol, latest_feedback)
|
| 363 |
+
return _build_accept_action()
|
| 364 |
+
|
| 365 |
+
return _build_initial_protocol_action(observation)
|
| 366 |
+
|
| 367 |
+
|
| 368 |
def _render_history(entries: list[ConversationEntry]) -> str:
|
| 369 |
lines: list[str] = []
|
| 370 |
for entry in entries:
|
|
|
|
| 574 |
)
|
| 575 |
)
|
| 576 |
return "\n".join(lines)
|
| 577 |
+
|
| 578 |
+
|
| 579 |
+
def _build_accept_action() -> ScientistAction:
|
| 580 |
+
return ScientistAction(
|
| 581 |
+
action_type=ScientistActionType.ACCEPT,
|
| 582 |
+
sample_size=0,
|
| 583 |
+
controls=[],
|
| 584 |
+
technique="",
|
| 585 |
+
duration_days=0,
|
| 586 |
+
required_equipment=[],
|
| 587 |
+
required_reagents=[],
|
| 588 |
+
questions=[],
|
| 589 |
+
rationale="",
|
| 590 |
+
)
|
| 591 |
+
|
| 592 |
+
|
| 593 |
+
def _build_initial_protocol_action(
|
| 594 |
+
observation: ScientistObservation,
|
| 595 |
+
) -> ScientistAction:
|
| 596 |
+
domain = _infer_domain(observation)
|
| 597 |
+
defaults = _baseline_defaults_for_domain(domain)
|
| 598 |
+
|
| 599 |
+
return ScientistAction(
|
| 600 |
+
action_type=ScientistActionType.PROPOSE_PROTOCOL,
|
| 601 |
+
sample_size=defaults["sample_size"],
|
| 602 |
+
controls=list(defaults["controls"]),
|
| 603 |
+
technique=defaults["technique"],
|
| 604 |
+
duration_days=defaults["duration_days"],
|
| 605 |
+
required_equipment=[],
|
| 606 |
+
required_reagents=[],
|
| 607 |
+
questions=[],
|
| 608 |
+
rationale=(
|
| 609 |
+
f"Baseline proposal for {observation.paper_title}: "
|
| 610 |
+
f"use a concise {defaults['technique']} plan aligned to the stated goal "
|
| 611 |
+
f"'{observation.experiment_goal}'."
|
| 612 |
+
),
|
| 613 |
+
)
|
| 614 |
+
|
| 615 |
+
|
| 616 |
+
def _build_revision_action(
|
| 617 |
+
protocol: Protocol,
|
| 618 |
+
feedback: ConversationEntry,
|
| 619 |
+
) -> ScientistAction:
|
| 620 |
+
reduced_sample_size = max(1, protocol.sample_size // 2) if protocol.sample_size else 1
|
| 621 |
+
reduced_duration = max(1, protocol.duration_days - 1) if protocol.duration_days else 1
|
| 622 |
+
revised_controls = list(protocol.controls) or ["fallback_review"]
|
| 623 |
+
|
| 624 |
+
return ScientistAction(
|
| 625 |
+
action_type=ScientistActionType.REVISE_PROTOCOL,
|
| 626 |
+
sample_size=reduced_sample_size,
|
| 627 |
+
controls=revised_controls,
|
| 628 |
+
technique=protocol.technique,
|
| 629 |
+
duration_days=reduced_duration,
|
| 630 |
+
required_equipment=list(protocol.required_equipment),
|
| 631 |
+
required_reagents=list(protocol.required_reagents),
|
| 632 |
+
questions=[],
|
| 633 |
+
rationale=(
|
| 634 |
+
"Baseline revision reduces scope to address the latest Lab Manager "
|
| 635 |
+
f"concern: {feedback.message}"
|
| 636 |
+
),
|
| 637 |
+
)
|
| 638 |
+
|
| 639 |
+
|
| 640 |
+
def _latest_lab_manager_feedback(
|
| 641 |
+
observation: ScientistObservation,
|
| 642 |
+
) -> ConversationEntry | None:
|
| 643 |
+
for entry in reversed(observation.conversation_history):
|
| 644 |
+
if entry.role == "lab_manager":
|
| 645 |
+
return entry
|
| 646 |
+
return None
|
| 647 |
+
|
| 648 |
+
|
| 649 |
+
def _feedback_requires_revision(message: str) -> bool:
|
| 650 |
+
lowered = message.lower()
|
| 651 |
+
return any(token in lowered for token in _BLOCKER_HINTS)
|
| 652 |
+
|
| 653 |
+
|
| 654 |
+
def _infer_domain(observation: ScientistObservation) -> str:
|
| 655 |
+
haystack = " ".join(
|
| 656 |
+
[
|
| 657 |
+
observation.paper_title,
|
| 658 |
+
observation.paper_hypothesis,
|
| 659 |
+
observation.paper_method,
|
| 660 |
+
observation.paper_key_finding,
|
| 661 |
+
observation.experiment_goal,
|
| 662 |
+
]
|
| 663 |
+
).lower()
|
| 664 |
+
|
| 665 |
+
if any(token in haystack for token in _ML_HINTS):
|
| 666 |
+
return "machine_learning"
|
| 667 |
+
if any(token in haystack for token in _FINANCE_HINTS):
|
| 668 |
+
return "finance_trading"
|
| 669 |
+
return "mathematics"
|
| 670 |
+
|
| 671 |
+
|
| 672 |
+
def _baseline_defaults_for_domain(domain: str) -> dict[str, Any]:
|
| 673 |
+
if domain == "machine_learning":
|
| 674 |
+
return {
|
| 675 |
+
"sample_size": 8,
|
| 676 |
+
"controls": ["published_split_check", "heldout_evaluation"],
|
| 677 |
+
"technique": "published_split_replication",
|
| 678 |
+
"duration_days": 2,
|
| 679 |
+
}
|
| 680 |
+
if domain == "finance_trading":
|
| 681 |
+
return {
|
| 682 |
+
"sample_size": 12,
|
| 683 |
+
"controls": ["drawdown_guardrail", "offline_evaluation_split"],
|
| 684 |
+
"technique": "offline_backtest_workflow",
|
| 685 |
+
"duration_days": 2,
|
| 686 |
+
}
|
| 687 |
+
return {
|
| 688 |
+
"sample_size": 4,
|
| 689 |
+
"controls": ["equality_case_check", "final_verification_pass"],
|
| 690 |
+
"technique": "structured_proof_outline",
|
| 691 |
+
"duration_days": 1,
|
| 692 |
+
}
|
server/app.py
CHANGED
|
@@ -31,6 +31,11 @@ from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
|
|
| 31 |
from fastapi.middleware.cors import CORSMiddleware
|
| 32 |
from pydantic import BaseModel
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
from replicalab.config import (
|
| 35 |
API_HOST,
|
| 36 |
API_PORT,
|
|
@@ -40,11 +45,16 @@ from replicalab.config import (
|
|
| 40 |
STUB_ACCEPT_REWARD,
|
| 41 |
WS_IDLE_TIMEOUT_SECONDS,
|
| 42 |
)
|
| 43 |
-
from replicalab.scenarios import
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
from replicalab.models import (
|
| 45 |
ConversationEntry,
|
| 46 |
EpisodeLog,
|
| 47 |
EpisodeState,
|
|
|
|
| 48 |
LabManagerObservation,
|
| 49 |
Observation,
|
| 50 |
Protocol,
|
|
@@ -121,6 +131,7 @@ class _StubEnv:
|
|
| 121 |
self._state = EpisodeState()
|
| 122 |
self._logs: list[ConversationEntry] = []
|
| 123 |
self._episode_id: str = ""
|
|
|
|
| 124 |
|
| 125 |
# ── public interface (matches ReplicaLabEnv) ──────────────────────────
|
| 126 |
|
|
@@ -133,6 +144,7 @@ class _StubEnv:
|
|
| 133 |
self._episode_id = str(uuid.uuid4())
|
| 134 |
self._logs = []
|
| 135 |
pack = generate_scenario(seed=seed, template=scenario, difficulty=difficulty)
|
|
|
|
| 136 |
self._state = EpisodeState(
|
| 137 |
seed=seed,
|
| 138 |
scenario_template=scenario,
|
|
@@ -160,10 +172,12 @@ class _StubEnv:
|
|
| 160 |
|
| 161 |
def step(self, action: ScientistAction) -> StepResult:
|
| 162 |
self._state.round_number += 1
|
|
|
|
| 163 |
self._logs.append(self._scientist_log_entry(action))
|
| 164 |
-
self.
|
|
|
|
| 165 |
self._state.conversation_history = list(self._logs)
|
| 166 |
-
self._state.current_protocol =
|
| 167 |
done = (
|
| 168 |
action.action_type == "accept"
|
| 169 |
or self._state.round_number >= self._state.max_rounds
|
|
@@ -218,20 +232,39 @@ class _StubEnv:
|
|
| 218 |
action_type=action_type,
|
| 219 |
)
|
| 220 |
|
| 221 |
-
def _lab_manager_log_entry(self, action:
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
action_type
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
action_type = "report_feasibility"
|
| 228 |
return ConversationEntry(
|
| 229 |
role="lab_manager",
|
| 230 |
-
message=
|
| 231 |
round_number=self._state.round_number,
|
| 232 |
action_type=action_type,
|
| 233 |
)
|
| 234 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 235 |
def _protocol_from_action(self, action: ScientistAction) -> Optional[Protocol]:
|
| 236 |
if action.action_type not in {"propose_protocol", "revise_protocol"}:
|
| 237 |
return self._state.current_protocol
|
|
|
|
| 31 |
from fastapi.middleware.cors import CORSMiddleware
|
| 32 |
from pydantic import BaseModel
|
| 33 |
|
| 34 |
+
from replicalab.agents import (
|
| 35 |
+
check_feasibility,
|
| 36 |
+
compose_lab_manager_response,
|
| 37 |
+
suggest_alternative,
|
| 38 |
+
)
|
| 39 |
from replicalab.config import (
|
| 40 |
API_HOST,
|
| 41 |
API_PORT,
|
|
|
|
| 45 |
STUB_ACCEPT_REWARD,
|
| 46 |
WS_IDLE_TIMEOUT_SECONDS,
|
| 47 |
)
|
| 48 |
+
from replicalab.scenarios import (
|
| 49 |
+
NormalizedScenarioPack,
|
| 50 |
+
available_scenario_families,
|
| 51 |
+
generate_scenario,
|
| 52 |
+
)
|
| 53 |
from replicalab.models import (
|
| 54 |
ConversationEntry,
|
| 55 |
EpisodeLog,
|
| 56 |
EpisodeState,
|
| 57 |
+
LabManagerAction,
|
| 58 |
LabManagerObservation,
|
| 59 |
Observation,
|
| 60 |
Protocol,
|
|
|
|
| 131 |
self._state = EpisodeState()
|
| 132 |
self._logs: list[ConversationEntry] = []
|
| 133 |
self._episode_id: str = ""
|
| 134 |
+
self._scenario_pack: Optional[NormalizedScenarioPack] = None
|
| 135 |
|
| 136 |
# ── public interface (matches ReplicaLabEnv) ──────────────────────────
|
| 137 |
|
|
|
|
| 144 |
self._episode_id = str(uuid.uuid4())
|
| 145 |
self._logs = []
|
| 146 |
pack = generate_scenario(seed=seed, template=scenario, difficulty=difficulty)
|
| 147 |
+
self._scenario_pack = pack
|
| 148 |
self._state = EpisodeState(
|
| 149 |
seed=seed,
|
| 150 |
scenario_template=scenario,
|
|
|
|
| 172 |
|
| 173 |
def step(self, action: ScientistAction) -> StepResult:
|
| 174 |
self._state.round_number += 1
|
| 175 |
+
proposed_protocol = self._protocol_from_action(action)
|
| 176 |
self._logs.append(self._scientist_log_entry(action))
|
| 177 |
+
lab_manager_action = self._lab_manager_action(proposed_protocol)
|
| 178 |
+
self._logs.append(self._lab_manager_log_entry(lab_manager_action))
|
| 179 |
self._state.conversation_history = list(self._logs)
|
| 180 |
+
self._state.current_protocol = proposed_protocol
|
| 181 |
done = (
|
| 182 |
action.action_type == "accept"
|
| 183 |
or self._state.round_number >= self._state.max_rounds
|
|
|
|
| 232 |
action_type=action_type,
|
| 233 |
)
|
| 234 |
|
| 235 |
+
def _lab_manager_log_entry(self, action: LabManagerAction) -> ConversationEntry:
|
| 236 |
+
action_type = (
|
| 237 |
+
action.action_type.value
|
| 238 |
+
if hasattr(action.action_type, "value")
|
| 239 |
+
else str(action.action_type)
|
| 240 |
+
)
|
|
|
|
| 241 |
return ConversationEntry(
|
| 242 |
role="lab_manager",
|
| 243 |
+
message=action.explanation,
|
| 244 |
round_number=self._state.round_number,
|
| 245 |
action_type=action_type,
|
| 246 |
)
|
| 247 |
|
| 248 |
+
def _lab_manager_action(self, protocol: Optional[Protocol]) -> LabManagerAction:
|
| 249 |
+
if protocol is None or self._scenario_pack is None:
|
| 250 |
+
return LabManagerAction(
|
| 251 |
+
action_type="report_feasibility",
|
| 252 |
+
feasible=True,
|
| 253 |
+
budget_ok=True,
|
| 254 |
+
equipment_ok=True,
|
| 255 |
+
reagents_ok=True,
|
| 256 |
+
schedule_ok=True,
|
| 257 |
+
staff_ok=True,
|
| 258 |
+
suggested_technique="",
|
| 259 |
+
suggested_sample_size=0,
|
| 260 |
+
suggested_controls=[],
|
| 261 |
+
explanation="No concrete protocol is available to review yet.",
|
| 262 |
+
)
|
| 263 |
+
|
| 264 |
+
check_result = check_feasibility(protocol, self._scenario_pack)
|
| 265 |
+
suggestion = suggest_alternative(protocol, check_result, self._scenario_pack)
|
| 266 |
+
return compose_lab_manager_response(check_result, suggestion)
|
| 267 |
+
|
| 268 |
def _protocol_from_action(self, action: ScientistAction) -> Optional[Protocol]:
|
| 269 |
if action.action_type not in {"propose_protocol", "revise_protocol"}:
|
| 270 |
return self._state.current_protocol
|
tests/test_lab_manager_policy.py
CHANGED
|
@@ -3,9 +3,10 @@ from __future__ import annotations
|
|
| 3 |
from replicalab.agents.lab_manager_policy import (
|
| 4 |
AlternativeSuggestion,
|
| 5 |
check_feasibility,
|
|
|
|
| 6 |
suggest_alternative,
|
| 7 |
)
|
| 8 |
-
from replicalab.models import Protocol
|
| 9 |
from replicalab.scenarios import generate_scenario
|
| 10 |
|
| 11 |
|
|
@@ -316,3 +317,94 @@ def test_suggest_alternative_reports_remaining_failures() -> None:
|
|
| 316 |
result = suggest_alternative(protocol, check, scenario)
|
| 317 |
assert result is not None
|
| 318 |
assert "policy" in result.remaining_failures
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
from replicalab.agents.lab_manager_policy import (
|
| 4 |
AlternativeSuggestion,
|
| 5 |
check_feasibility,
|
| 6 |
+
compose_lab_manager_response,
|
| 7 |
suggest_alternative,
|
| 8 |
)
|
| 9 |
+
from replicalab.models import LabManagerActionType, Protocol
|
| 10 |
from replicalab.scenarios import generate_scenario
|
| 11 |
|
| 12 |
|
|
|
|
| 317 |
result = suggest_alternative(protocol, check, scenario)
|
| 318 |
assert result is not None
|
| 319 |
assert "policy" in result.remaining_failures
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
# ---------------------------------------------------------------------------
|
| 323 |
+
# AGT 07 - compose_lab_manager_response
|
| 324 |
+
# ---------------------------------------------------------------------------
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
def test_compose_lab_manager_response_accepts_feasible_protocol() -> None:
|
| 328 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 329 |
+
protocol = _protocol_for_scenario(scenario)
|
| 330 |
+
check = check_feasibility(protocol, scenario)
|
| 331 |
+
|
| 332 |
+
action = compose_lab_manager_response(check)
|
| 333 |
+
|
| 334 |
+
assert action.action_type is LabManagerActionType.ACCEPT
|
| 335 |
+
assert action.feasible is True
|
| 336 |
+
assert action.suggested_technique == ""
|
| 337 |
+
assert "Accepted." in action.explanation
|
| 338 |
+
|
| 339 |
+
|
| 340 |
+
def test_compose_lab_manager_response_suggests_alternative_when_revision_exists() -> None:
|
| 341 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 342 |
+
protocol = _protocol_for_scenario(
|
| 343 |
+
scenario,
|
| 344 |
+
sample_size=200,
|
| 345 |
+
duration_days=scenario.lab_manager_observation.time_limit_days,
|
| 346 |
+
controls=["baseline", "ablation", "sanity_check"],
|
| 347 |
+
required_equipment=list(scenario.lab_manager_observation.equipment_available),
|
| 348 |
+
required_reagents=list(scenario.lab_manager_observation.reagents_in_stock),
|
| 349 |
+
)
|
| 350 |
+
check = check_feasibility(protocol, scenario)
|
| 351 |
+
suggestion = suggest_alternative(protocol, check, scenario)
|
| 352 |
+
|
| 353 |
+
assert suggestion is not None
|
| 354 |
+
action = compose_lab_manager_response(check, suggestion)
|
| 355 |
+
|
| 356 |
+
assert action.action_type is LabManagerActionType.SUGGEST_ALTERNATIVE
|
| 357 |
+
assert action.feasible is False
|
| 358 |
+
assert action.suggested_sample_size == suggestion.revised_protocol.sample_size
|
| 359 |
+
assert action.suggested_controls == suggestion.revised_protocol.controls
|
| 360 |
+
assert "Suggested revision:" in action.explanation
|
| 361 |
+
|
| 362 |
+
|
| 363 |
+
def test_compose_lab_manager_response_rejects_when_no_revision_exists() -> None:
|
| 364 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 365 |
+
protocol = _protocol_for_scenario(
|
| 366 |
+
scenario,
|
| 367 |
+
required_equipment=["Imaginary GPU Rack"],
|
| 368 |
+
)
|
| 369 |
+
check = check_feasibility(protocol, scenario)
|
| 370 |
+
suggestion = suggest_alternative(protocol, check, scenario)
|
| 371 |
+
|
| 372 |
+
action = compose_lab_manager_response(check, suggestion)
|
| 373 |
+
|
| 374 |
+
assert action.action_type is LabManagerActionType.REJECT
|
| 375 |
+
assert action.feasible is False
|
| 376 |
+
assert "No deterministic revision could satisfy" in action.explanation
|
| 377 |
+
|
| 378 |
+
|
| 379 |
+
def test_compose_lab_manager_response_reports_non_lab_issues() -> None:
|
| 380 |
+
scenario = _scenario("finance_trading", "easy")
|
| 381 |
+
protocol = _protocol_for_scenario(
|
| 382 |
+
scenario,
|
| 383 |
+
technique="live trading execution plan",
|
| 384 |
+
rationale="Use live trading once the backtest looks strong.",
|
| 385 |
+
)
|
| 386 |
+
check = check_feasibility(protocol, scenario)
|
| 387 |
+
suggestion = suggest_alternative(protocol, check, scenario)
|
| 388 |
+
|
| 389 |
+
action = compose_lab_manager_response(check, suggestion)
|
| 390 |
+
|
| 391 |
+
assert action.action_type is LabManagerActionType.REPORT_FEASIBILITY
|
| 392 |
+
assert action.feasible is True
|
| 393 |
+
assert "policy" in action.explanation.lower()
|
| 394 |
+
|
| 395 |
+
|
| 396 |
+
def test_compose_lab_manager_response_uses_custom_renderer_without_changing_verdict() -> None:
|
| 397 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 398 |
+
protocol = _protocol_for_scenario(scenario)
|
| 399 |
+
check = check_feasibility(protocol, scenario)
|
| 400 |
+
|
| 401 |
+
action = compose_lab_manager_response(
|
| 402 |
+
check,
|
| 403 |
+
explanation_renderer=lambda action_type, result, suggestion: (
|
| 404 |
+
f"Renderer saw {action_type.value} with feasible={result.feasible}."
|
| 405 |
+
),
|
| 406 |
+
)
|
| 407 |
+
|
| 408 |
+
assert action.action_type is LabManagerActionType.ACCEPT
|
| 409 |
+
assert action.feasible is True
|
| 410 |
+
assert action.explanation == "Renderer saw accept with feasible=True."
|
tests/test_scientist_policy.py
CHANGED
|
@@ -6,6 +6,7 @@ from replicalab.agents.scientist_policy import (
|
|
| 6 |
RetryMetadata,
|
| 7 |
ScientistCallResult,
|
| 8 |
ScientistOutputParseError,
|
|
|
|
| 9 |
build_scientist_system_prompt,
|
| 10 |
call_scientist_with_retry,
|
| 11 |
format_scientist_observation,
|
|
@@ -456,3 +457,102 @@ def test_retry_metadata_serializable() -> None:
|
|
| 456 |
restored = RetryMetadata.model_validate_json(dumped)
|
| 457 |
assert restored.attempt_count == 1
|
| 458 |
assert restored.retry_count == 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
RetryMetadata,
|
| 7 |
ScientistCallResult,
|
| 8 |
ScientistOutputParseError,
|
| 9 |
+
build_baseline_scientist_action,
|
| 10 |
build_scientist_system_prompt,
|
| 11 |
call_scientist_with_retry,
|
| 12 |
format_scientist_observation,
|
|
|
|
| 457 |
restored = RetryMetadata.model_validate_json(dumped)
|
| 458 |
assert restored.attempt_count == 1
|
| 459 |
assert restored.retry_count == 0
|
| 460 |
+
|
| 461 |
+
|
| 462 |
+
# ---------------------------------------------------------------------------
|
| 463 |
+
# AGT 04 - build_baseline_scientist_action
|
| 464 |
+
# ---------------------------------------------------------------------------
|
| 465 |
+
|
| 466 |
+
|
| 467 |
+
def test_baseline_scientist_proposes_protocol_for_fresh_observation() -> None:
|
| 468 |
+
action = build_baseline_scientist_action(_base_observation())
|
| 469 |
+
|
| 470 |
+
assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
|
| 471 |
+
assert action.sample_size >= 1
|
| 472 |
+
assert action.duration_days >= 1
|
| 473 |
+
assert action.questions == []
|
| 474 |
+
assert action.rationale
|
| 475 |
+
|
| 476 |
+
|
| 477 |
+
def test_baseline_scientist_accepts_existing_protocol_without_blocker() -> None:
|
| 478 |
+
obs = _base_observation(
|
| 479 |
+
current_protocol=Protocol(
|
| 480 |
+
sample_size=10,
|
| 481 |
+
controls=["baseline_check"],
|
| 482 |
+
technique="published_split_replication",
|
| 483 |
+
duration_days=2,
|
| 484 |
+
required_equipment=[],
|
| 485 |
+
required_reagents=[],
|
| 486 |
+
rationale="Initial protocol is already in place.",
|
| 487 |
+
),
|
| 488 |
+
conversation_history=[
|
| 489 |
+
ConversationEntry(
|
| 490 |
+
role="lab_manager",
|
| 491 |
+
message="The current plan remains feasible.",
|
| 492 |
+
round_number=1,
|
| 493 |
+
action_type="report_feasibility",
|
| 494 |
+
)
|
| 495 |
+
],
|
| 496 |
+
round_number=1,
|
| 497 |
+
)
|
| 498 |
+
|
| 499 |
+
action = build_baseline_scientist_action(obs)
|
| 500 |
+
|
| 501 |
+
assert action.action_type is ScientistActionType.ACCEPT
|
| 502 |
+
assert action.sample_size == 0
|
| 503 |
+
assert action.controls == []
|
| 504 |
+
|
| 505 |
+
|
| 506 |
+
def test_baseline_scientist_revises_when_latest_feedback_has_blocker() -> None:
|
| 507 |
+
obs = _base_observation(
|
| 508 |
+
current_protocol=Protocol(
|
| 509 |
+
sample_size=12,
|
| 510 |
+
controls=["published_split_check", "heldout_evaluation"],
|
| 511 |
+
technique="published_split_replication",
|
| 512 |
+
duration_days=3,
|
| 513 |
+
required_equipment=[],
|
| 514 |
+
required_reagents=[],
|
| 515 |
+
rationale="Original scope is full-size.",
|
| 516 |
+
),
|
| 517 |
+
conversation_history=[
|
| 518 |
+
ConversationEntry(
|
| 519 |
+
role="lab_manager",
|
| 520 |
+
message="The current GPU plan is booked, so the schedule is too tight.",
|
| 521 |
+
round_number=1,
|
| 522 |
+
action_type="suggest_alternative",
|
| 523 |
+
)
|
| 524 |
+
],
|
| 525 |
+
round_number=1,
|
| 526 |
+
)
|
| 527 |
+
|
| 528 |
+
action = build_baseline_scientist_action(obs)
|
| 529 |
+
|
| 530 |
+
assert action.action_type is ScientistActionType.REVISE_PROTOCOL
|
| 531 |
+
assert action.sample_size == 6
|
| 532 |
+
assert action.duration_days == 2
|
| 533 |
+
assert "latest Lab Manager concern" in action.rationale
|
| 534 |
+
|
| 535 |
+
|
| 536 |
+
def test_baseline_scientist_finishes_stub_episode_without_crashing() -> None:
|
| 537 |
+
from server.app import _StubEnv
|
| 538 |
+
|
| 539 |
+
env = _StubEnv()
|
| 540 |
+
|
| 541 |
+
first_observation = env.reset(
|
| 542 |
+
seed=14,
|
| 543 |
+
scenario="ml_benchmark",
|
| 544 |
+
difficulty="easy",
|
| 545 |
+
).scientist
|
| 546 |
+
assert first_observation is not None
|
| 547 |
+
|
| 548 |
+
first_action = build_baseline_scientist_action(first_observation)
|
| 549 |
+
first_step = env.step(first_action)
|
| 550 |
+
assert first_step.done is False
|
| 551 |
+
assert first_step.observation is not None
|
| 552 |
+
assert first_step.observation.scientist is not None
|
| 553 |
+
|
| 554 |
+
second_action = build_baseline_scientist_action(first_step.observation.scientist)
|
| 555 |
+
second_step = env.step(second_action)
|
| 556 |
+
|
| 557 |
+
assert second_step.done is True
|
| 558 |
+
assert second_step.info.agreement_reached is True
|