ayushozha Claude Opus 4.6 commited on
Commit
7c2246c
·
1 Parent(s): 58a2a96

Add AGT 04/05/07 implementations, server integration, and doc updates

Browse files

- AGT 04: build_baseline_scientist_action with domain inference
- AGT 05: check_feasibility with per-dimension checks
- AGT 07: compose_lab_manager_response with action type selection
- Server: real Lab Manager pipeline replaces hardcoded stub responses
- Tests: AGT 04/05/07 test coverage
- Docs: task completion tracking and model selection notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

README.md CHANGED
@@ -131,6 +131,12 @@ pytest tests/
131
 
132
  RL training improves the Scientist agent's ability to negotiate effective, feasible plans.
133
 
 
 
 
 
 
 
134
  ### Quick Start (Google Colab)
135
 
136
  1. Open `notebooks/train_colab.ipynb` in Google Colab
 
131
 
132
  RL training improves the Scientist agent's ability to negotiate effective, feasible plans.
133
 
134
+ ### Selected base model
135
+
136
+ - **Primary Scientist model:** `Qwen3-4B`
137
+ - **Stretch fallback:** `Qwen3-8B`
138
+ - **Decision record:** `docs/agt11_scientist_model_selection.md`
139
+
140
  ### Quick Start (Google Colab)
141
 
142
  1. Open `notebooks/train_colab.ipynb` in Google Colab
ReplicaLab_Comprehensive_Task_Division.md CHANGED
@@ -295,12 +295,19 @@ Create a stable shared codebase, contracts, and development workflow so all work
295
  - Completed scope for `FND 09`: added `openenv.yaml` with OpenEnv manifest metadata plus the minimal repo wiring required for local OpenEnv validation (`openenv-core` dependency, `server` script entry point, `uv.lock`, and `server.app.main()`)
296
  - Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
297
  - Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
 
 
298
  - Partial backend scope imported from Max's PR: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md` were normalized onto the current standards and validated locally against the stub env
299
- - Remaining work now unblocked by `FND 01`: `FND 03`
300
  - Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
301
  - Newly unblocked by `FND 06`: `DOC 01`
302
- - Remaining Epic E01 work still gated by follow-on dependencies: `FND 12`, `FND 13`
 
303
  - Remaining completion items for the imported backend scaffold: real-env integration, Docker validation, and final deployment verification
 
 
 
 
 
304
 
305
  ### User stories
306
 
@@ -316,7 +323,7 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
316
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
317
  | FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
318
  | FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ✅ Completed | Person B (Ayush) |
319
- | FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | Not started | |
320
  | FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
321
  | FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ✅ Completed | Person B (Ayush) |
322
  | FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ✅ Completed | Person B (Ayush) |
@@ -325,7 +332,7 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
325
  | FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | ✅ Completed | Person B (Ayush) |
326
  | FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | ✅ Completed | Person B (Ayush) |
327
  | FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | ✅ Completed | Max (Person C) |
328
- | FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | Not started | |
329
  | FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | ⬜ Not started | — |
330
 
331
  ---
@@ -343,11 +350,29 @@ Define the environment contracts cleanly so state, actions, and observations are
343
  - `MOD 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
344
  - `MOD 03` status: completed on 2026-03-08
345
  - `MOD 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
 
 
 
 
 
 
 
 
 
346
  - Completed scope for `MOD 01`: replaced the placeholder `ScientistAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, added focused schema tests, and patched the stub server so `accept` no longer overwrites the current protocol with default values
347
  - Completed scope for `MOD 02`: replaced the placeholder `LabManagerAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency across budget, equipment, reagent, schedule, and staff checks, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests
348
  - Completed scope for `MOD 03`: introduced typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the current stub server and focused tests
 
 
 
 
 
349
  - Newly unblocked by `MOD 01`: `MOD 05`, `MOD 09`
350
  - Newly unblocked by `MOD 03`: `MOD 04`, `MOD 11`
 
 
 
 
351
 
352
  ### User stories
353
 
@@ -364,15 +389,15 @@ As the training loop, I need deterministic state serialization so episodes can b
364
  | MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
365
  | MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
366
  | MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | ✅ Completed | Person B (Ayush) |
367
- | MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works | Not started | |
368
- | MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons | Not started | |
369
  | MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | ⬜ Not started | — |
370
  | MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | ⬜ Not started | — |
371
  | MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | ⬜ Not started | — |
372
- | MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error | Not started | — |
373
  | MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | ⬜ Not started | — |
374
- | MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape | Not started | |
375
- | MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code | Not started | |
376
 
377
  ---
378
 
@@ -393,17 +418,17 @@ As a judge, I want normalized constraints and resources so the environment tests
393
 
394
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
395
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
396
- | SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env | Not started | |
397
- | SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | MOD 04 | 0.75h | all scenario builders return the same normalized top level structure and mapper-ready inputs | Not started | |
398
- | SCN 03 | E03.2 | Person A | `replicalab/scenarios/math_reasoning.py` | Implement mathematics template with theorem, proof-goal, tool, time, and review constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | Not started | |
399
- | SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with dataset, compute, time, and evaluation constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | Not started | |
400
- | SCN 05 | E03.2 | Person A | `replicalab/scenarios/finance_trading.py` | Implement finance and trading planning template with risk, capital, slippage, and backtest constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | Not started | |
401
- | SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard by mechanically altering constraints, resources, and conflicts | SCN 03 to SCN 05 | 1h | difficulty visibly changes the normalized scenario pack in a meaningful way | Not started | |
402
- | SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement normalized constraint and resource generator for budget, time, compute, personnel, stock, and bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints or resources | Not started | |
403
- | SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement hidden reference spec and allowed substitutions per template | SCN 03 to SCN 05 | 1h | hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Not started | |
404
- | SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content | Not started | |
405
- | SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary | Not started | |
406
- | SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing | Not started | — |
407
  | SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | ⬜ Not started | — |
408
  | SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability | ⬜ Not started | — |
409
 
@@ -426,17 +451,17 @@ As the Lab Manager, I want grounded negotiation plus deterministic feasibility c
426
 
427
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
428
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
429
- | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | Not started | — |
430
- | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | Not started | — |
431
  | AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ⬜ Not started | — |
432
- | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | Not started | — |
433
- | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | Not started | |
434
- | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | Not started | — |
435
- | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | Not started | — |
436
  | AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | ⬜ Not started | — |
437
  | AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
438
  | AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
439
- | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | Not started | — |
440
 
441
  ---
442
 
 
295
  - Completed scope for `FND 09`: added `openenv.yaml` with OpenEnv manifest metadata plus the minimal repo wiring required for local OpenEnv validation (`openenv-core` dependency, `server` script entry point, `uv.lock`, and `server.app.main()`)
296
  - Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
297
  - Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
298
+ - Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
299
+ - Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
300
  - Partial backend scope imported from Max's PR: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md` were normalized onto the current standards and validated locally against the stub env
 
301
  - Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
302
  - Newly unblocked by `FND 06`: `DOC 01`
303
+ - Newly unblocked by `FND 03`: `FND 13`, `UI 01`
304
+ - Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
305
  - Remaining completion items for the imported backend scaffold: real-env integration, Docker validation, and final deployment verification
306
+ - Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
307
+ - Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
308
+ - Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
309
+ - Newly unblocked by `SCN 11` and `AGT 01`: `AGT 02`, `AGT 11`, `TRN 04`, `TRN 08`
310
+ - Remaining Epic E03 work after the scenario bundle: `SCN 12`, `SCN 13`
311
 
312
  ### User stories
313
 
 
323
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
324
  | FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
325
  | FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ✅ Completed | Person B (Ayush) |
326
+ | FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | Completed | Kush |
327
  | FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
328
  | FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ✅ Completed | Person B (Ayush) |
329
  | FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ✅ Completed | Person B (Ayush) |
 
332
  | FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | ✅ Completed | Person B (Ayush) |
333
  | FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | ✅ Completed | Person B (Ayush) |
334
  | FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | ✅ Completed | Max (Person C) |
335
+ | FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | Completed | Kush |
336
  | FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | ⬜ Not started | — |
337
 
338
  ---
 
350
  - `MOD 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
351
  - `MOD 03` status: completed on 2026-03-08
352
  - `MOD 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
353
+ - `MOD 04` status: completed on 2026-03-08
354
+ - `MOD 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
355
+ - `MOD 05` status: completed on 2026-03-08
356
+ - `MOD 05` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
357
+ - `MOD 11` status: completed on 2026-03-08
358
+ - `MOD 11` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
359
+ - `MOD 12` status: completed on 2026-03-08
360
+ - `MOD 12` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
361
+ - `MOD 09` status: completed on 2026-03-08
362
  - Completed scope for `MOD 01`: replaced the placeholder `ScientistAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, added focused schema tests, and patched the stub server so `accept` no longer overwrites the current protocol with default values
363
  - Completed scope for `MOD 02`: replaced the placeholder `LabManagerAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency across budget, equipment, reagent, schedule, and staff checks, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests
364
  - Completed scope for `MOD 03`: introduced typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the current stub server and focused tests
365
+ - Completed scope for `MOD 04`: replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs
366
+ - Completed scope for `MOD 05`: added deterministic semantic protocol validation in `replicalab/utils/validation.py` with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack
367
+ - Completed scope for `MOD 11`: introduced typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly
368
+ - Completed scope for `MOD 12`: added `replicalab/config.py` as the shared constants module for default scenario, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults; updated the server and scenario builders to import those constants instead of repeating magic numbers
369
+ - Completed scope for `MOD 09`: added `replicalab/agents/scientist_policy.py` with a raw-text parser that extracts JSON from plain text or fenced blocks, validates it into `ScientistAction`, and raises an explicit `ScientistOutputParseError` for missing JSON, invalid JSON, or schema failures; added focused parser tests and package exports
370
  - Newly unblocked by `MOD 01`: `MOD 05`, `MOD 09`
371
  - Newly unblocked by `MOD 03`: `MOD 04`, `MOD 11`
372
+ - Newly unblocked by `MOD 04`: `MOD 07`, `ENV 01`
373
+ - Newly unblocked by `MOD 05`: `MOD 06`, `AGT 05`
374
+ - `MOD 11` does not introduce a new formal dependency edge by itself, but it stabilizes `StepResult` metadata for environment, API, replay, and training consumers
375
+ - `MOD 09` does not fully unblock a new task by itself, but it removes one half of the blocker on `AGT 03`; `AGT 03` now only waits on `AGT 02`
376
 
377
  ### User stories
378
 
 
389
  | MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
390
  | MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
391
  | MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | ✅ Completed | Person B (Ayush) |
392
+ | MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works | Completed | Person B (Ayush) |
393
+ | MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons | Completed | Person B (Ayush) |
394
  | MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | ⬜ Not started | — |
395
  | MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | ⬜ Not started | — |
396
  | MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | ⬜ Not started | — |
397
+ | MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error | Completed | — |
398
  | MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | ⬜ Not started | — |
399
+ | MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape | Completed | Person B (Ayush) |
400
+ | MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code | Completed | Person B (Ayush) |
401
 
402
  ---
403
 
 
418
 
419
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
420
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
421
+ | SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env | Completed | Person B (Ayush) |
422
+ | SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | MOD 04 | 0.75h | all scenario builders return the same normalized top level structure and mapper-ready inputs | Completed | Person B (Ayush) |
423
+ | SCN 03 | E03.2 | Person A | `replicalab/scenarios/math_reasoning.py` | Implement mathematics template with theorem, proof-goal, tool, time, and review constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | Completed | Person B (Ayush) |
424
+ | SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with dataset, compute, time, and evaluation constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | Completed | Person B (Ayush) |
425
+ | SCN 05 | E03.2 | Person A | `replicalab/scenarios/finance_trading.py` | Implement finance and trading planning template with risk, capital, slippage, and backtest constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | Completed | Person B (Ayush) |
426
+ | SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard by mechanically altering constraints, resources, and conflicts | SCN 03 to SCN 05 | 1h | difficulty visibly changes the normalized scenario pack in a meaningful way | Completed | Person B (Ayush) |
427
+ | SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement normalized constraint and resource generator for budget, time, compute, personnel, stock, and bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints or resources | Completed | Person B (Ayush) |
428
+ | SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement hidden reference spec and allowed substitutions per template | SCN 03 to SCN 05 | 1h | hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Completed | Person B (Ayush) |
429
+ | SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content | Completed | Person B (Ayush) |
430
+ | SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary | Completed | Person B (Ayush) |
431
+ | SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing | Completed | — |
432
  | SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | ⬜ Not started | — |
433
  | SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability | ⬜ Not started | — |
434
 
 
451
 
452
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
453
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
454
+ | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | Completed | — |
455
+ | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | Completed | — |
456
  | AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ⬜ Not started | — |
457
+ | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | Completed | — |
458
+ | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | Completed | Person B (Ayush) |
459
+ | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | Completed | — |
460
+ | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | Completed | — |
461
  | AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | ⬜ Not started | — |
462
  | AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
463
  | AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
464
+ | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | Completed | — |
465
 
466
  ---
467
 
docs/agt11_scientist_model_selection.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGT 11 Scientist Model Selection
2
+
3
+ ## Decision
4
+
5
+ The primary Scientist training model is **Qwen3-4B**.
6
+
7
+ The stretch fallback is **Qwen3-8B** if H100-only training is acceptable and
8
+ the 4B model underperforms on structured planning quality.
9
+
10
+ ## Why Qwen3-4B
11
+
12
+ - Strong enough for structured JSON action output without moving to a much
13
+ slower large-model loop.
14
+ - Small enough for fast RL iteration on H100 and practical 4-bit Colab use.
15
+ - Open weights with a permissive Apache 2.0 license.
16
+ - A clean fit for the current architecture: train the Scientist first while the
17
+ reward and Lab Manager grounding remain deterministic.
18
+
19
+ ## Why Not Smaller
20
+
21
+ - Smaller checkpoints are cheaper, but they are more likely to underperform on
22
+ multi-step technical planning and strict output schemas.
23
+ - The project needs enough reasoning quality to negotiate across mathematics,
24
+ machine learning, and finance-trading scenarios.
25
+
26
+ ## Why Not Larger By Default
27
+
28
+ - Larger checkpoints slow rollout collection and raise memory pressure.
29
+ - The judged artifact still needs a credible Colab path, not only an H100-only
30
+ path.
31
+ - Faster iteration matters more than squeezing out a marginal quality gain at
32
+ this stage.
33
+
34
+ ## Project Usage
35
+
36
+ - **Scientist MVP training:** `Qwen3-4B`
37
+ - **Stretch Scientist training:** `Qwen3-8B`
38
+ - **Lab Manager future path:** reuse the same base family with a separate
39
+ role-specific adapter if the team later trains both roles
40
+
41
+ ## Notes
42
+
43
+ - The reward loop stays deterministic regardless of the model choice.
44
+ - `TRN 14` should mirror this decision on the notebook side once the Colab
45
+ skeleton exists.
docs/ayush/task_breakdown.md CHANGED
@@ -9,185 +9,187 @@ No assumptions from other documents are used to reclassify blocked status.
9
 
10
  ## 1. Blocking Status
11
 
12
- Per the source of truth, `FND 08` is now complete and `FND 09` has landed in
13
- `openenv.yaml` with OpenEnv-compatible runtime wiring in the repo.
14
- `MOD 01` and `MOD 03` are now complete, so `MOD 09` is immediately unblocked
15
- and the observation side of `AGT 02` is no longer waiting on Person A.
16
- The next Ayush-owned tasks after `MOD 09` are still gated by Kian's remaining
17
- scenario and validation deliverables, starting with `SCN 09` and `MOD 05`.
18
- The prompt and Lab Manager workstream now assumes a normalized scenario pack
19
- below the stable outer contract, so Ayush-owned prompting should be assembled
20
- from mapped scenario data rather than hard-coded to one domain.
 
 
21
 
22
  ---
23
 
24
- ## 2. Blocked by Kian (Person A)-Led External Dependencies
25
 
26
- These tasks are first gated by upstream deliverables, primarily from Kian (Person A).
27
- `JDG 10` also requires Max (Person C) to ship `JDG 07`.
 
 
 
 
 
28
 
29
- | ID | Task | Depends On | Person A Deliverable | Est |
30
- |----|------|-----------|---------------------|-----|
31
- | AGT 01 | Draft domain-neutral Scientist system prompt | MOD 01, SCN 11 | ScientistAction schema + generate_scenario | 0.75h |
32
- | AGT 05 | Implement deterministic feasibility checker (shared A+B) | SCN 07, MOD 05 | Constraint generator + validation | 1.25h |
33
- | SCN 11 | Create golden scenarios for prompt testing | SCN 09 | generate_scenario() | 0.75h |
34
- | JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | Reward breakdown (A) + logging (C) | 0.5h |
35
 
36
- **Total: 4 tasks, 3.25h**
37
 
38
- ### What to ask Kian for first (priority order)
 
 
39
 
40
- 1. **SCN 09** (generate normalized scenario packs) -- unblocks SCN 11 and then AGT 01
41
- 2. **SCN 07 + MOD 05** (normalized constraints/resources + validation) -- unblocks AGT 05, AGT 06, AGT 07
42
- 3. **JDG 05 + JDG 06** (reward breakdown + explanation) -- unblocks AGT 10 and is only part of the path for JDG 10
43
- 4. **SCN 08** (minimum viable replication spec) -- unblocks AGT 06 after AGT 05
44
 
45
  ---
46
 
47
- ## 3. Blocked by Person A Then Person B Internal Chain
 
 
 
 
 
 
 
 
48
 
49
- These depend on Person A deliverables AND on earlier Person B tasks. They unblock
50
- sequentially as both streams deliver.
 
 
 
 
 
 
 
51
 
52
  | ID | Task | Depends On | Blocked By | Est |
53
  |----|------|-----------|-----------|-----|
54
- | AGT 02 | Observation to prompt formatting helper | AGT 01 (B) + MOD 03 (A) | Person B: AGT 01 | 0.75h |
55
- | AGT 03 | Parse plus retry for malformed output | MOD 09 (B) + AGT 02 (B) | Person B: MOD 09, AGT 02 | 0.75h |
56
- | AGT 04 | Baseline heuristic Scientist | AGT 02 (B) | Person B: AGT 02 | 1h |
57
- | AGT 06 | Alternative suggestion logic from allowed substitutions | AGT 05 (A+B), SCN 08 (A) | Person A: SCN 08, Person A+B: AGT 05 | 1h |
58
- | AGT 07 | Model-backed Lab Manager response synthesis | AGT 05 (A+B) | Person A+B: AGT 05 | 0.75h |
59
- | AGT 08 | Prompt formatting and parse tests | AGT 01 to AGT 04 (B) | Person B: AGT 01-04 | 0.75h |
60
- | AGT 10 | Write domain-neutral prompt text files for all 3 roles | AGT 01 (B) + AGT 07 (B) + JDG 06 (A) | Person A: JDG 06, Person B: AGT 01, AGT 07 | 0.75h |
61
- | AGT 11 | Select and document base model | AGT 01 (B) | Person B: AGT 01 | 0.5h |
62
 
63
- **Total: 8 tasks, 6.25h**
64
 
65
  ---
66
 
67
- ## 4. Blocked by Max (Person C) (Needs Server/API)
68
 
69
- Cannot proceed until Person C delivers the server and deployment.
70
 
71
- | ID | Task | Depends On | Person C Deliverable | Est |
72
- |----|------|-----------|---------------------|-----|
73
  | TRN 01 | Notebook skeleton | API 10 | Deployed HF Space | 0.5h |
74
- | TRN 03 | Env client wrapper in notebook | API 06 | WebSocket handler | 1h |
75
- | TRN 13 | client.py reusable module | API 06 | WebSocket handler | 1h |
76
 
77
  **Total: 3 tasks, 2.5h**
78
 
79
- ### What to ask Max for first (priority order)
80
 
81
- 1. **API 06** (WebSocket handler) -- unblocks TRN 03 and TRN 13
82
- 2. **API 10** (deployed HF Space) -- unblocks TRN 01 notebook skeleton
83
 
84
  ---
85
 
86
- ## 5. Deep Training Chain (Sequential After Upstream Deliverables)
87
 
88
- These execute in strict order once Person A, Person C, and earlier Person B tasks
89
  are done.
90
 
91
  | Order | ID | Task | Depends On | Est |
92
  |-------|----|------|-----------|-----|
93
- | 1 | TRN 02 | Package install and model setup cell | TRN 01 (B) | 0.75h |
94
- | 2 | TRN 14 | Select and document base model (notebook side) | TRN 01 (B) | 0.5h |
95
- | 3 | TRN 04 | Rollout collection loop | TRN 03 (B), AGT 01 (B) | 1h |
96
- | 4 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 (B) | 1.25h |
97
- | 5 | TRN 06 | Log episode metrics | JDG 10 (B), TRN 04 (B) | 0.75h |
98
- | 6 | TRN 07 | Plot reward curves | TRN 06 (B) | 0.5h |
99
- | 7 | TRN 08 | Before vs after eval on fixed seeds | SCN 11 (B), TRN 05 (B) | 1h |
100
- | 8 | TRN 09 | Policy loading for trained checkpoint | TRN 05 (B) | 0.5h |
101
- | 9 | TRN 10 | Export plots to outputs/plots | TRN 07 (B) | 0.25h |
102
- | 10 | TRN 15 | Agreement and invalid action rate aggregation | TRN 06 (B), TRN 08 (B), OBS 09 (C) | 0.5h |
103
- | 11 | OBS 06 | Log training run metadata | TRN 06 (B) | 0.5h |
104
 
105
  **Total: 11 tasks, 7.5h**
106
 
107
  ---
108
 
109
- ## 6. Blocked by Kush (Person D)
110
 
111
- | ID | Task | Depends On | Person D Deliverable | Est |
112
- |----|------|-----------|---------------------|-----|
113
- | TST 09 | Notebook smoke test for fresh runtime | TRN 12 | Evaluation storytelling insights | 0.5h |
114
 
115
  **Total: 1 task, 0.5h**
116
 
117
  ---
118
 
119
- ## 7. Recommended Execution Order
120
-
121
- All phases are gated by the listed external dependency being delivered first.
122
 
123
- ### Phase 1: Active now
124
 
125
- 1. **FND 08** -- Completed and signed off
126
- 2. **FND 09** -- Completed in `openenv.yaml`
127
- 3. **MOD 09** -- Build output parser for ScientistAction
 
 
128
 
129
- ### Phase 2: After Kian delivers SCN 09
130
 
131
- 4. **SCN 11** -- Create golden scenarios for prompt testing
132
- 5. **AGT 01** -- Draft domain-neutral Scientist system prompt
133
- 6. **AGT 11** -- Select and document base model
134
 
135
- ### Phase 3: After AGT 01
136
 
137
- 7. **AGT 02** -- Build observation to prompt formatter
138
- 8. **AGT 03** -- Add parse plus retry logic
139
- 9. **AGT 04** -- Build baseline heuristic Scientist
140
- 10. **AGT 08** -- Write tests for prompt formatting and parsing
141
 
142
- ### Phase 4: After Kian delivers SCN 07 + MOD 05 + SCN 08 + JDG 05 + JDG 06, and Max delivers JDG 07
143
 
144
- 11. **AGT 05** -- Deterministic feasibility checker (shared with Person A)
145
- 12. **AGT 06** -- Alternative suggestion logic from allowed substitutions
146
- 13. **AGT 07** -- Model-backed Lab Manager response synthesis
147
- 14. **AGT 10** -- Write all domain-neutral prompt text files
148
- 15. **JDG 10** -- Expose component metrics for training plots
149
 
150
- ### Phase 5: After Max delivers API 06 + API 10
151
 
152
- 16. **TRN 13** -- Build client.py reusable module
153
- 17. **TRN 01** -- Create notebook skeleton
154
- 18. **TRN 02** -- Package install and model setup cell
155
- 19. **TRN 03** -- Environment client wrapper in notebook
156
- 20. **TRN 14** -- Document base model choice (notebook side)
157
 
158
- ### Phase 6: Training Pipeline (internal chain)
159
 
160
- 21. **TRN 04** -- Rollout collection loop
161
- 22. **TRN 05** -- Connect to GRPO trainer
162
- 23. **TRN 06** -- Log episode metrics
163
- 24. **TRN 07** -- Plot reward curves
164
- 25. **TRN 08** -- Before vs after evaluation
165
- 26. **TRN 09** -- Policy loading for checkpoints
166
- 27. **TRN 10** -- Export plots
167
- 28. **TRN 15** -- Agreement and invalid action rate metrics
168
- 29. **OBS 06** -- Training run metadata logging
169
 
170
- ### Phase 7: After Kush delivers TRN 12
171
 
172
- 30. **TST 09** -- Notebook smoke test
173
 
174
  ---
175
 
176
- ## 8. Summary Table
177
 
178
  | Category | Count | Hours |
179
  |----------|-------|-------|
180
  | Active now | 1 | 0.75h |
181
- | Blocked by Person A (first-order) | 4 | 3.25h |
182
- | Blocked by Person A then Person B chain | 8 | 6.25h |
183
- | Blocked by Person C | 3 | 2.5h |
184
- | Deep training chain (internal) | 11 | 7.5h |
185
- | Blocked by Person D | 1 | 0.5h |
186
- | **Total** | **29** | **21.5h** |
 
187
 
188
  ---
189
 
190
- ## 9. Base Model Assumptions
191
 
192
  ### Trainable Scientist policy
193
 
@@ -212,7 +214,8 @@ of the training reward loop.
212
 
213
  ### Hybrid Lab Manager
214
 
215
- The MVP Lab Manager path is now hybrid:
 
216
  - A deterministic feasibility checker remains the source of truth for
217
  `feasible`, constraint flags, and any final structured `LabManagerAction`.
218
  - Model-backed response generation is used for negotiation language and
@@ -220,14 +223,15 @@ The MVP Lab Manager path is now hybrid:
220
  - The reward formula does not change. The deterministic rubric scores the final
221
  plan against the hidden reference spec regardless of how the Lab Manager
222
  generates its language.
223
- - Reward does not split into separate Scientist vs Lab Manager objectives.
224
  Both roles share the same cooperative reward signal.
225
  - If the team later shares one base model across both roles, the pragmatic
226
- default is one base model (Qwen3-4B) with separate role-specific adapters.
227
 
228
  ### Prompt assembly
229
 
230
  Ayush-owned prompts should be assembled from normalized scenario data:
 
231
  - `task_summary`
232
  - `success_criteria`
233
  - `constraints`
@@ -240,20 +244,7 @@ physics, or biology.
240
 
241
  ---
242
 
243
- ## 10. Key Risks for Person B
244
-
245
- | Risk | Impact | Mitigation |
246
- |------|--------|------------|
247
- | Person A SCN 09 or MOD 05 delayed | Blocks AGT 01 via SCN 11 and delays AGT 05-07 plus downstream work | Communicate priority order to Person A early |
248
- | Person C API delayed | Blocks entire training pipeline (TRN 01-15) | Coordinate with Person C on API 06 timeline |
249
- | Qwen3-4B underperforms on structured output | Scientist produces low quality protocols | Fall back to Qwen3-8B on H100, use reduced-scale Colab fallback |
250
- | RL training produces flat rewards | No improvement to demo | Have baseline heuristic ready, tune reward weights with Person A |
251
- | Scientist produces invalid JSON | Rollout loop crashes | AGT 03 parse plus retry is critical, build it robust |
252
- | Hybrid Lab Manager increases variance if generation settings are too loose | Slower RL convergence | Keep checker as source of truth, use low-variance generation or frozen manager weights during Scientist training |
253
-
254
- ---
255
-
256
- ## 11. Files Person B Owns
257
 
258
  | File | Purpose |
259
  |------|---------|
 
9
 
10
  ## 1. Blocking Status
11
 
12
+ `FND 08`, `FND 09`, `MOD 09`, `SCN 11`, and `AGT 01` are now complete.
13
+ The scenario prerequisite bundle (`SCN 01` to `SCN 10`) also exists in the
14
+ repo, so Ayush no longer waits on `SCN 09` to start prompt-layer work.
15
+
16
+ Ayush now has one fully unblocked task:
17
+
18
+ 1. `AGT 03` -- highest leverage next task inside the Scientist chain
19
+
20
+ The prompt and Lab Manager workstream continues to assume a normalized scenario
21
+ pack below the stable outer contract, so Ayush-owned prompting should be
22
+ assembled from mapped scenario data rather than hard-coded to one domain.
23
 
24
  ---
25
 
26
+ ## 2. Active Now
27
 
28
+ | ID | Task | Depends On | Why It Is Ready | Est |
29
+ |----|------|-----------|-----------------|-----|
30
+ | AGT 03 | Parse plus retry for malformed output | MOD 09, AGT 02 | The parser and observation formatter are now both complete | 0.75h |
31
+
32
+ **Total: 1 task, 0.75h**
33
+
34
+ ---
35
 
36
+ ## 3. Internal Ayush Chain After AGT 03
 
 
 
 
 
37
 
38
+ These are blocked only by earlier Ayush-owned work.
39
 
40
+ | ID | Task | Depends On | Blocked By | Est |
41
+ |----|------|-----------|-----------|-----|
42
+ | AGT 08 | Prompt formatting and parse tests | AGT 01 to AGT 04 | Person B: AGT 03 | 0.75h |
43
 
44
+ **Total: 1 task, 0.75h**
 
 
 
45
 
46
  ---
47
 
48
+ ## 4. Still Blocked by Kian (Person A) or Mixed A+B Work
49
+
50
+ | ID | Task | Depends On | Remaining External Deliverable | Est |
51
+ |----|------|-----------|-------------------------------|-----|
52
+ | JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | `JDG 05` from Kian and `JDG 07` from Max | 0.5h |
53
+
54
+ **Total: 1 task, 0.5h**
55
+
56
+ ### What to ask Kian for first
57
 
58
+ 1. `JDG 05` and `JDG 06` -- unlock `JDG 10` and later `AGT 10`
59
+ 2. `SCN 13` -- deepens the booking-conflict layer for the Lab Manager path
60
+ 3. `ENV 01` -- makes the real environment path available beyond the stub server
61
+
62
+ ---
63
+
64
+ ## 5. Mixed Chain After AGT 05 and Judge Work
65
+
66
+ These depend on both Ayush-owned work and remaining upstream work.
67
 
68
  | ID | Task | Depends On | Blocked By | Est |
69
  |----|------|-----------|-----------|-----|
70
+ | AGT 10 | Write domain-neutral prompt text files for all 3 roles | AGT 01, AGT 07, JDG 06 | Person A: JDG 06 | 0.75h |
 
 
 
 
 
 
 
71
 
72
+ **Total: 1 task, 0.75h**
73
 
74
  ---
75
 
76
+ ## 6. Blocked by Max (Person C)
77
 
78
+ Cannot proceed until Max delivers the server and deployment pieces.
79
 
80
+ | ID | Task | Depends On | Max Deliverable | Est |
81
+ |----|------|-----------|----------------|-----|
82
  | TRN 01 | Notebook skeleton | API 10 | Deployed HF Space | 0.5h |
83
+ | TRN 03 | Env client wrapper in notebook | API 06 | WebSocket handler against the real env | 1h |
84
+ | TRN 13 | `client.py` reusable module | API 06 | WebSocket handler against the real env | 1h |
85
 
86
  **Total: 3 tasks, 2.5h**
87
 
88
+ ### What to ask Max for first
89
 
90
+ 1. `API 06` -- unblocks `TRN 03` and `TRN 13`
91
+ 2. `API 10` -- unblocks `TRN 01`
92
 
93
  ---
94
 
95
+ ## 7. Deep Training Chain
96
 
97
+ These execute in strict order once Person A, Person C, and earlier Ayush tasks
98
  are done.
99
 
100
  | Order | ID | Task | Depends On | Est |
101
  |-------|----|------|-----------|-----|
102
+ | 1 | TRN 02 | Package install and model setup cell | TRN 01 | 0.75h |
103
+ | 2 | TRN 14 | Select and document base model (notebook side) | TRN 01 | 0.5h |
104
+ | 3 | TRN 04 | Rollout collection loop | TRN 03, AGT 01 | 1h |
105
+ | 4 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | 1.25h |
106
+ | 5 | TRN 06 | Log episode metrics | JDG 10, TRN 04 | 0.75h |
107
+ | 6 | TRN 07 | Plot reward curves | TRN 06 | 0.5h |
108
+ | 7 | TRN 08 | Before vs after eval on fixed seeds | SCN 11, TRN 05 | 1h |
109
+ | 8 | TRN 09 | Policy loading for trained checkpoint | TRN 05 | 0.5h |
110
+ | 9 | TRN 10 | Export plots to outputs/plots | TRN 07 | 0.25h |
111
+ | 10 | TRN 15 | Agreement and invalid action rate aggregation | TRN 06, TRN 08, OBS 09 | 0.5h |
112
+ | 11 | OBS 06 | Log training run metadata | TRN 06 | 0.5h |
113
 
114
  **Total: 11 tasks, 7.5h**
115
 
116
  ---
117
 
118
+ ## 8. Blocked by Kush (Person D)
119
 
120
+ | ID | Task | Depends On | Kush Deliverable | Est |
121
+ |----|------|-----------|-----------------|-----|
122
+ | TST 09 | Notebook smoke test for fresh runtime | TRN 12 | Evaluation storytelling and final notebook flow | 0.5h |
123
 
124
  **Total: 1 task, 0.5h**
125
 
126
  ---
127
 
128
+ ## 9. Recommended Execution Order
 
 
129
 
130
+ ### Phase 1: Completed
131
 
132
+ 1. `FND 08`
133
+ 2. `FND 09`
134
+ 3. `MOD 09`
135
+ 4. `SCN 11`
136
+ 5. `AGT 01`
137
 
138
+ ### Phase 2: Active now
139
 
140
+ 6. `AGT 03`
 
 
141
 
142
+ ### Phase 3: After AGT 03
143
 
144
+ 7. `AGT 08`
 
 
 
145
 
146
+ ### Phase 4: After judge work
147
 
148
+ 8. `AGT 10`
149
+ 9. `JDG 10`
 
 
 
150
 
151
+ ### Phase 5: After Max lands `API 06` and `API 10`
152
 
153
+ 10. `TRN 13`
154
+ 11. `TRN 01`
155
+ 12. `TRN 02`
156
+ 13. `TRN 03`
157
+ 14. `TRN 14`
158
 
159
+ ### Phase 6: Training pipeline
160
 
161
+ 15. `TRN 04`
162
+ 16. `TRN 05`
163
+ 17. `TRN 06`
164
+ 18. `TRN 07`
165
+ 19. `TRN 08`
166
+ 20. `TRN 09`
167
+ 21. `TRN 10`
168
+ 22. `TRN 15`
169
+ 23. `OBS 06`
170
 
171
+ ### Phase 7: Final notebook validation
172
 
173
+ 24. `TST 09`
174
 
175
  ---
176
 
177
+ ## 10. Summary Table
178
 
179
  | Category | Count | Hours |
180
  |----------|-------|-------|
181
  | Active now | 1 | 0.75h |
182
+ | Internal Ayush chain after AGT 03 | 1 | 0.75h |
183
+ | Blocked by Kian or mixed A+B work | 1 | 0.5h |
184
+ | Mixed chain after AGT 05 and judge work | 1 | 0.75h |
185
+ | Blocked by Max | 3 | 2.5h |
186
+ | Deep training chain | 11 | 7.5h |
187
+ | Blocked by Kush | 1 | 0.5h |
188
+ | **Total remaining** | **19** | **13.25h** |
189
 
190
  ---
191
 
192
+ ## 11. Base Model Assumptions
193
 
194
  ### Trainable Scientist policy
195
 
 
214
 
215
  ### Hybrid Lab Manager
216
 
217
+ The MVP Lab Manager path is hybrid:
218
+
219
  - A deterministic feasibility checker remains the source of truth for
220
  `feasible`, constraint flags, and any final structured `LabManagerAction`.
221
  - Model-backed response generation is used for negotiation language and
 
223
  - The reward formula does not change. The deterministic rubric scores the final
224
  plan against the hidden reference spec regardless of how the Lab Manager
225
  generates its language.
226
+ - Reward does not split into separate Scientist versus Lab Manager objectives.
227
  Both roles share the same cooperative reward signal.
228
  - If the team later shares one base model across both roles, the pragmatic
229
+ default is one base model (`Qwen3-4B`) with separate role-specific adapters.
230
 
231
  ### Prompt assembly
232
 
233
  Ayush-owned prompts should be assembled from normalized scenario data:
234
+
235
  - `task_summary`
236
  - `success_criteria`
237
  - `constraints`
 
244
 
245
  ---
246
 
247
+ ## 12. Files Person B Owns
 
 
 
 
 
 
 
 
 
 
 
 
 
248
 
249
  | File | Purpose |
250
  |------|---------|
docs/ayush/task_list.md CHANGED
@@ -6,41 +6,47 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
6
 
7
  ## Current status
8
 
9
- - `FND 04` is complete in `replicalab/models.py`
10
  - `FND 08` is complete in `docs/fnd08_frozen_json_contract.md`
11
- - `FND 09` is complete in `openenv.yaml`
12
- - `MOD 01` is now complete, so `MOD 09` is the next Ayush-owned task ready to execute
13
- - `MOD 03` is now complete, so the observation side of `AGT 02` is ready once `AGT 01` exists
14
- - Prompt and parser work should now be built from normalized scenario data, not hard-coded per domain
15
- - The Lab Manager lane is now hybrid: deterministic feasibility truth plus model-backed response synthesis
16
- - After `MOD 09`, the next blocked Ayush-side step remains `SCN 11` once `SCN 09` lands
 
 
 
 
 
 
 
17
 
18
  ---
19
 
20
  ## Epic E02. Domain Models
21
 
22
- - [ ] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01 | Status: unblocked and ready now
23
 
24
  ---
25
 
26
  ## Epic E03. Scenario Engine
27
 
28
- - [ ] **SCN 11** | Create hand checked golden scenarios for prompt testing | 0.75h | Depends: SCN 09
29
 
30
  ---
31
 
32
  ## Epic E04. Scientist Agent and Lab Manager Policy
33
 
34
- - [ ] **AGT 01** | Draft domain-neutral system prompt for Scientist role from normalized scenario data | 0.75h | Depends: MOD 01, SCN 11
35
- - [ ] **AGT 02** | Build observation to prompt formatting helper from normalized scenario-derived observations | 0.75h | Depends: AGT 01, MOD 03
36
- - [ ] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02
37
- - [ ] **AGT 04** | Build baseline heuristic Scientist for non trained smoke tests | 1h | Depends: AGT 02
38
- - [ ] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05
39
- - [ ] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08
40
- - [ ] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05
 
41
  - [ ] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04
42
  - [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
43
- - [ ] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01
44
 
45
  ---
46
 
@@ -82,7 +88,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
82
 
83
  ## Shared Tasks
84
 
85
- - [x] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04 (done) | Status: completed and signed off
86
 
87
  ---
88
 
@@ -91,4 +97,6 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
91
  | Metric | Value |
92
  |--------|-------|
93
  | Total tasks | 29 |
 
 
94
  | Total estimated hours | 21.5h |
 
6
 
7
  ## Current status
8
 
 
9
  - `FND 08` is complete in `docs/fnd08_frozen_json_contract.md`
10
+ - `MOD 09` is complete in `replicalab/agents/scientist_policy.py`
11
+ - `SCN 11` is complete in `tests/fixtures/golden_scenarios.json`
12
+ - `AGT 01` is complete in `replicalab/agents/scientist_policy.py`
13
+ - `AGT 02` is complete in `replicalab/agents/scientist_policy.py`
14
+ - `AGT 04` is complete in `replicalab/agents/scientist_policy.py`
15
+ - `AGT 05` is complete in `replicalab/agents/lab_manager_policy.py`
16
+ - `AGT 06` is complete in `replicalab/agents/lab_manager_policy.py`
17
+ - `AGT 07` is complete in `replicalab/agents/lab_manager_policy.py`
18
+ - `AGT 11` is complete in `docs/agt11_scientist_model_selection.md`
19
+ - The scenario prerequisite bundle (`SCN 01` to `SCN 10`) is now present in the repo, so Ayush prompt work is backed by real normalized scenario packs instead of placeholders
20
+ - The next fully unblocked Ayush task is `AGT 03`
21
+ - `AGT 03` is now the highest-leverage next step because the formatter and parser are both in place, so the retry loop can complete the Scientist action path end-to-end
22
+ - `AGT 10` now waits only on `JDG 06`
23
 
24
  ---
25
 
26
  ## Epic E02. Domain Models
27
 
28
+ - [x] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01 | Status: completed on 2026-03-08
29
 
30
  ---
31
 
32
  ## Epic E03. Scenario Engine
33
 
34
+ - [x] **SCN 11** | Create hand checked golden scenarios for prompt testing | 0.75h | Depends: SCN 09 | Status: completed on 2026-03-08
35
 
36
  ---
37
 
38
  ## Epic E04. Scientist Agent and Lab Manager Policy
39
 
40
+ - [x] **AGT 01** | Draft domain-neutral system prompt for Scientist role from normalized scenario data | 0.75h | Depends: MOD 01, SCN 11 | Status: completed on 2026-03-08
41
+ - [x] **AGT 02** | Build observation to prompt formatting helper from normalized scenario-derived observations | 0.75h | Depends: AGT 01, MOD 03 | Status: completed on 2026-03-08
42
+ - [x] **AGT 04** | Build baseline heuristic Scientist for non trained smoke tests | 1h | Depends: AGT 02 | Status: completed on 2026-03-08
43
+ - [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
44
+ - [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
45
+ - [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
46
+ - [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
47
+ - [ ] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: ready now
48
  - [ ] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04
49
  - [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
 
50
 
51
  ---
52
 
 
88
 
89
  ## Shared Tasks
90
 
91
+ - [x] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04 | Status: completed and signed off
92
 
93
  ---
94
 
 
97
  | Metric | Value |
98
  |--------|-------|
99
  | Total tasks | 29 |
100
+ | Completed | 10 |
101
+ | Remaining | 19 |
102
  | Total estimated hours | 21.5h |
docs/changes.md CHANGED
@@ -23,5 +23,11 @@ Rules:
23
  | 2026-03-08 | Person B (Ayush) | FND 08 and FND 09 | Recorded Kian-side sign-off for the shared contract and executed `FND 09` even though it was assigned to Person A | The same contributor is currently covering both the Kian and Ayush lanes, and the OpenEnv registration layer needed to be real rather than left as a placeholder | `FND 08` is now complete, `openenv.yaml` exists, and the repo now carries the minimal OpenEnv runtime wiring needed for local validation | The real environment class in `replicalab/env/replicalab_env.py` is still a later task |
24
  | 2026-03-08 | Person B (Ayush) | MOD 01 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict `ScientistAction` validator was the highest-leverage unblocker for downstream parser and validation work | `ScientistAction` now enforces the frozen contract, `MOD 09` and `MOD 05` are unblocked, and focused schema tests now exist in `tests/test_models.py` | `MOD 03` is the next schema-critical Kian task |
25
  | 2026-03-08 | Person B (Ayush) | MOD 02 and MOD 03 | Executed the tasks even though they were assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict Lab Manager plus typed observation contracts were the fastest way to stabilize the shared schema surface before parser, state, and environment work fan out | `LabManagerAction`, `ConversationEntry`, `Protocol`, and both observation branches now enforce the frozen contract, `MOD 04` and `MOD 11` are unblocked, and the stub server path is verified against the typed models | `MOD 12`, `SCN 01`, and `MOD 05` are the next Kian-lane tasks |
 
 
 
 
 
26
  | 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
 
27
 
 
23
  | 2026-03-08 | Person B (Ayush) | FND 08 and FND 09 | Recorded Kian-side sign-off for the shared contract and executed `FND 09` even though it was assigned to Person A | The same contributor is currently covering both the Kian and Ayush lanes, and the OpenEnv registration layer needed to be real rather than left as a placeholder | `FND 08` is now complete, `openenv.yaml` exists, and the repo now carries the minimal OpenEnv runtime wiring needed for local validation | The real environment class in `replicalab/env/replicalab_env.py` is still a later task |
24
  | 2026-03-08 | Person B (Ayush) | MOD 01 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict `ScientistAction` validator was the highest-leverage unblocker for downstream parser and validation work | `ScientistAction` now enforces the frozen contract, `MOD 09` and `MOD 05` are unblocked, and focused schema tests now exist in `tests/test_models.py` | `MOD 03` is the next schema-critical Kian task |
25
  | 2026-03-08 | Person B (Ayush) | MOD 02 and MOD 03 | Executed the tasks even though they were assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict Lab Manager plus typed observation contracts were the fastest way to stabilize the shared schema surface before parser, state, and environment work fan out | `LabManagerAction`, `ConversationEntry`, `Protocol`, and both observation branches now enforce the frozen contract, `MOD 04` and `MOD 11` are unblocked, and the stub server path is verified against the typed models | `MOD 12`, `SCN 01`, and `MOD 05` are the next Kian-lane tasks |
26
+ | 2026-03-08 | Person B (Ayush) | MOD 12 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and centralizing shared defaults was the cleanest way to stop config drift before the real environment and scoring modules expand | `replicalab/config.py` now holds shared defaults for scenario selection, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, and the server plus scenario builders import them instead of repeating literals | `MOD 05`, `MOD 04`, and `MOD 11` remain the next Kian-lane foundation tasks |
27
+ | 2026-03-08 | Person B (Ayush) | MOD 11 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and a typed step-result contract was needed before the environment, API, replay, and training paths grew around loose metadata | `RewardBreakdown`, `StepInfo`, and typed `StepResult.info` now exist, and the stub runtime explicitly constructs those reserved-key payloads while preserving debug metadata | `MOD 04` and `MOD 05` were the remaining Kian-lane foundation tasks after this |
28
+ | 2026-03-08 | Person B (Ayush) | MOD 04 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and state plus replay needed to use the same typed protocol and conversation models already enforced at the action and observation layers | `EpisodeState` and `EpisodeLog` now carry typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` fields, the stub runtime constructs those nested models explicitly, and replay serialization is now aligned with the typed contract | `MOD 07` and `ENV 01` are now unblocked |
29
+ | 2026-03-08 | Person B (Ayush) | MOD 05 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and structural schema validation was not enough to stop impossible or hallucinated plans from reaching the environment | `replicalab/utils/validation.py` now provides deterministic protocol validation against normalized scenario resources, substitutions, time limits, and required elements, returning structured issues instead of relying on ad hoc runtime checks | `MOD 06` and shared `AGT 05` are now unblocked |
30
+ | 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
31
  | 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
32
+ | 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
33
 
docs/completion.md CHANGED
@@ -20,27 +20,30 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
20
  | Metric | Value |
21
  |--------|-------|
22
  | Total tasks | 152 |
23
- | Completed | 13 |
24
  | Partial / active | 10 |
25
- | Remaining | 139 |
26
- | **Completion rate** | **8.55%** |
27
 
28
  ### Completion by Person
29
 
30
  | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
31
  |--------|----------|----------------|----------------------|-----------|------|
32
- | Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 5 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03` done by Person B) | 43 | 12.24% |
33
- | Person B (Ayush) | 29 (27 solo + 2 shared with A) | 1 shared task (`FND 08`) | 0 | 28 | 3.45% |
34
- | Max (Person C) | 41 | 1 (`FND 11`) | 5 (`FND 01`, `FND 02`, `FND 05`, `FND 07`, `FND 10` done by Person B) | 35 | 14.63% |
35
  | Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
36
- | All (shared) | 3 | 1 (`FND 08`) | 0 | 2 | 33.33% |
37
-
38
- Note: Person B (Ayush) has completed one shared task in their own lane
39
- (`FND 08`) and has also executed eleven tasks outside their assigned ownership
40
- (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`, `FND 09`,
41
- `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`) to keep the Kian, Max, and Kush
42
- dependency chain moving. The Ayush lane still has one direct implementation
43
- task ready now: `MOD 09`.
 
 
 
44
 
45
  ---
46
 
@@ -49,10 +52,10 @@ task ready now: `MOD 09`.
49
  | ID | Assigned To | Current Status | Remaining Acceptance Item |
50
  |----|-------------|----------------|---------------------------|
51
  | API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the stub env | Real env dependency and task-owner sign-off |
52
- | API 02 | Max (Person C) | `/reset` works locally against the stub env | Real env reset dependency and task-owner sign-off |
53
  | API 03 | Max (Person C) | `/step` works locally against the stub env | Real env step dependency and task-owner sign-off |
54
- | API 04 | Max (Person C) | `/scenarios` exists with a stub-backed scenario list | Real scenario source and task-owner sign-off |
55
- | API 06 | Max (Person C) | WebSocket reset, ping, and step work locally against the stub env | Real env integration and task-owner sign-off |
56
  | API 07 | Max (Person C) | Idle timeout and cleanup logic exist in the WebSocket path | Real env disconnect cleanup verification |
57
  | API 08 | Max (Person C) | `server/Dockerfile` exists | Local Docker build and run verification |
58
  | API 13 | Max (Person C) | CORS middleware exists for dev and hosted origins | Frontend integration verification |
@@ -78,6 +81,41 @@ task ready now: `MOD 09`.
78
  | MOD 01 | E02 | Person A | Implement `ScientistAction` schema | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Replaced the `ScientistAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so `accept` preserves the current protocol. | Valid scientist actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` and a stub-env `ScientistAction.model_validate(...)` smoke step |
79
  | MOD 02 | E02 | Person A | Implement `LabManagerAction` schema | `replicalab/models.py`, `tests/test_models.py` | 2026-03-08 | Replaced the `LabManagerAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests. | Valid lab manager actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` |
80
  | MOD 03 | E02 | Person A | Implement role specific observation models | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Added typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. | Scientist and lab observations serialize to JSON with stable keys | Yes - verified with `python -m pytest tests/test_models.py` and a stub `reset()` / `step()` JSON smoke test |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ### Shared Tasks - Completed
83
 
@@ -105,6 +143,7 @@ task ready now: `MOD 09`.
105
  |---------------|-------------------|
106
  | FND 01 | FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10 |
107
  | FND 02 | FND 11 |
 
108
  | FND 04 | FND 08, FND 09 |
109
  | FND 05 | No downstream dependencies |
110
  | FND 06 | DOC 01 |
@@ -113,21 +152,48 @@ task ready now: `MOD 09`.
113
  | FND 09 | OpenEnv registration layer is now present for later `/web` and deployment work |
114
  | FND 10 | No downstream dependencies |
115
  | FND 11 | No new formal dependencies, but server scaffold work can now install from a standalone requirements file |
 
116
  | MOD 01 | MOD 05, MOD 09 |
117
  | MOD 02 | No new formal dependencies, but the Lab Manager contract is now stable for later policy work |
118
  | MOD 03 | MOD 04, MOD 11 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ### Current Unblocked and Active Tasks
121
 
122
  | ID | Owner | Task | Unblocked By |
123
  |----|-------|------|-------------|
124
- | FND 03 | Max (Person C) | Initialize React plus Vite frontend shell | FND 01 |
125
- | MOD 04 | Kian (Person A) | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 |
126
- | MOD 05 | Kian (Person A) | Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab | MOD 01 |
127
- | MOD 09 | Person B (Ayush) | Add output parser that maps model text to `ScientistAction` | MOD 01 |
128
- | MOD 11 | Kian (Person A) | Implement `StepResult` model | MOD 03 |
129
- | MOD 12 | Kian (Person A) | Create shared environment configuration module | FND 08 |
130
- | SCN 01 | Kian (Person A) | Implement deterministic RNG helper | FND 08 |
 
 
 
131
  | DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
132
 
133
  ---
@@ -136,10 +202,10 @@ task ready now: `MOD 09`.
136
 
137
  | Epic | Total Tasks | Completed | Rate |
138
  |------|------------|-----------|------|
139
- | E01. Foundations and repository setup | 13 | 10 | 76.92% |
140
- | E02. Domain models, validation, state contracts | 12 | 3 | 25.00% |
141
- | E03. Scenario engine and constraint generation | 13 | 0 | 0% |
142
- | E04. Scientist agent and Lab Manager policy | 11 | 0 | 0% |
143
  | E05. Judge engine and reward logic | 11 | 0 | 0% |
144
  | E06. OpenEnv environment implementation | 11 | 0 | 0% |
145
  | E07. API, server, Docker, deployment | 19 | 0 | 0% |
 
20
  | Metric | Value |
21
  |--------|-------|
22
  | Total tasks | 152 |
23
+ | Completed | 38 |
24
  | Partial / active | 10 |
25
+ | Remaining | 104 |
26
+ | **Completion rate** | **25.00%** |
27
 
28
  ### Completion by Person
29
 
30
  | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
31
  |--------|----------|----------------|----------------------|-----------|------|
32
+ | Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 20 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `AGT 05` done by Person B) | 28 | 42.86% |
33
+ | Person B (Ayush) | 29 (27 solo + 2 shared with A) | 10 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 11`) | 0 | 19 | 34.48% |
34
+ | Max (Person C) | 41 | 1 (`FND 11`) | 7 (`FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 12` done by others) | 33 | 19.51% |
35
  | Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
36
+ | All (shared) | 3 | 2 (`FND 08`, `AGT 05`) | 0 | 1 | 66.67% |
37
+
38
+ Note: Person B (Ayush) has completed two shared tasks in their own lane
39
+ (`FND 08`, `AGT 05`) plus eight solo tasks in their own lane (`MOD 09`,
40
+ `SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 06`, `AGT 07`, `AGT 11`), and has also executed twenty-five tasks outside their assigned
41
+ ownership (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`,
42
+ `FND 09`, `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`,
43
+ `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`) to keep the Kian, Max, and Kush
44
+ dependency chain moving. Ayush now has one fully unblocked implementation
45
+ task available: `AGT 03`, with `AGT 10` reduced to a single remaining
46
+ external dependency on `JDG 06`.
47
 
48
  ---
49
 
 
52
  | ID | Assigned To | Current Status | Remaining Acceptance Item |
53
  |----|-------------|----------------|---------------------------|
54
  | API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the stub env | Real env dependency and task-owner sign-off |
55
+ | API 02 | Max (Person C) | `/reset` works locally against the stub env and now seeds normalized math / ML / finance scenarios through the shared generator | Real env reset dependency and task-owner sign-off |
56
  | API 03 | Max (Person C) | `/step` works locally against the stub env | Real env step dependency and task-owner sign-off |
57
+ | API 04 | Max (Person C) | `/scenarios` returns the normalized scenario-family list from the shared generator | Real env exposure and task-owner sign-off |
58
+ | API 06 | Max (Person C) | WebSocket reset, ping, and step work locally against the stub env, including normalized scenario-family resets | Real env integration and task-owner sign-off |
59
  | API 07 | Max (Person C) | Idle timeout and cleanup logic exist in the WebSocket path | Real env disconnect cleanup verification |
60
  | API 08 | Max (Person C) | `server/Dockerfile` exists | Local Docker build and run verification |
61
  | API 13 | Max (Person C) | CORS middleware exists for dev and hosted origins | Frontend integration verification |
 
81
  | MOD 01 | E02 | Person A | Implement `ScientistAction` schema | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Replaced the `ScientistAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so `accept` preserves the current protocol. | Valid scientist actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` and a stub-env `ScientistAction.model_validate(...)` smoke step |
82
  | MOD 02 | E02 | Person A | Implement `LabManagerAction` schema | `replicalab/models.py`, `tests/test_models.py` | 2026-03-08 | Replaced the `LabManagerAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests. | Valid lab manager actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` |
83
  | MOD 03 | E02 | Person A | Implement role specific observation models | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Added typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. | Scientist and lab observations serialize to JSON with stable keys | Yes - verified with `python -m pytest tests/test_models.py` and a stub `reset()` / `step()` JSON smoke test |
84
+ | MOD 04 | E02 | Person A | Implement `EpisodeState` and `EpisodeLog` models | `replicalab/models.py`, `server/app.py`, `tests/test_models.py` | 2026-03-08 | Replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs. | Full state round trip serialize plus deserialize works | Yes - verified with `python -m pytest tests/test_models.py` |
85
+ | MOD 05 | E02 | Person A | Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab | `replicalab/utils/validation.py`, `tests/test_models.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic semantic protocol validation with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack. | Invalid protocol examples are rejected with readable reasons | Yes - verified with `python -m pytest tests/test_models.py tests/test_scenarios.py` |
86
+ | MOD 11 | E02 | Person A | Implement `StepResult` model | `replicalab/models.py`, `server/app.py`, `tests/test_models.py` | 2026-03-08 | Added typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly. | Step result serializes cleanly and all consumers agree on its shape | Yes - verified with `python -m pytest tests/test_models.py` |
87
+ | MOD 12 | E02 | Person A | Create environment configuration module with shared constants | `replicalab/config.py`, `server/app.py`, `replicalab/scenarios/*.py`, `tests/test_config.py` | 2026-03-08 | Added a shared configuration module for default scenario and difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, then updated the server and scenario builders to import those constants instead of repeating literals. | All modules import config from one place and no magic numbers remain in env or scoring code | Yes - verified with `python -m pytest tests/test_config.py tests/test_scenarios.py` |
88
+ | SCN 01 | E03 | Person A | Implement deterministic RNG helper `seed_rng()` | `replicalab/utils/seed.py`, `replicalab/scenarios/templates.py` | 2026-03-08 | Added deterministic seed helpers that derive reproducible RNG namespaces for scenario generation. | Same seed always yields the same random choices and the seed utility is importable from scenarios and env | Yes - verified with `python -m pytest tests/test_scenarios.py` |
89
+ | SCN 02 | E03 | Person A | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | `replicalab/scenarios/templates.py` | 2026-03-08 | Added `NormalizedScenarioPack` plus strict `ScenarioConstraint`, `ScenarioResource`, `AllowedSubstitution`, and `HiddenReferenceSpec` models to standardize all scenario families. | All scenario builders return the same normalized top-level structure and mapper-ready inputs | Yes - verified with `python -m pytest tests/test_scenarios.py` |
90
+ | SCN 03 | E03 | Person A | Implement mathematics template | `replicalab/scenarios/math_reasoning.py` | 2026-03-08 | Added deterministic mathematics planning templates covering theorem, proof-goal, review, and time constraints. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
91
+ | SCN 04 | E03 | Person A | Implement ML benchmark template | `replicalab/scenarios/ml_benchmark.py` | 2026-03-08 | Added deterministic ML benchmark templates covering dataset, compute, time, and evaluation constraints. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
92
+ | SCN 05 | E03 | Person A | Implement finance and trading planning template | `replicalab/scenarios/finance_trading.py` | 2026-03-08 | Added deterministic offline finance and trading planning templates covering capital, drawdown, slippage, and backtest rules. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
93
+ | SCN 06 | E03 | Person A | Implement difficulty application for easy, medium, hard | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added mechanical difficulty scaling that adjusts budgets, time, staff, resource availability, and injected conflict constraints across easy, medium, and hard. | Difficulty visibly changes the normalized scenario pack in a meaningful way | Yes - verified with `python -m pytest tests/test_scenarios.py` |
94
+ | SCN 07 | E03 | Person A | Implement normalized constraint and resource generator | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added normalized constraint and resource mapping into role-specific observations with consistency checks for unique keys and non-contradictory generated packs. | No generated scenario contains contradictory constraints or resources | Yes - verified with `python -m pytest tests/test_scenarios.py` |
95
+ | SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
96
+ | SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
97
+ | SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
98
+
99
+ ### Person B (Ayush) - Completed own tasks
100
+
101
+ | ID | Epic | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
102
+ |----|------|------|-------------|------|---------------|--------------------|---------|
103
+ | MOD 09 | E02 | Add output parser that maps model text to `ScientistAction` | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added a raw-text parser that extracts JSON from plain output, fenced blocks, or prose-wrapped objects, validates it into `ScientistAction`, and raises explicit `ScientistOutputParseError` values for missing JSON, invalid JSON, or schema failures. | Parser returns structured action or explicit parse error | Yes - verified with `python -m pytest tests/test_scientist_policy.py tests/test_models.py` and a direct `parse_scientist_output(...)` smoke check |
104
+ | SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
105
+ | AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
106
+ | AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
107
+ | AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic non-LLM Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
108
+ | AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
109
+ | AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
110
+ | AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
111
+ | AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md` | 2026-03-08 | Recorded `Qwen3-4B` as the primary Scientist training model with `Qwen3-8B` as the H100-only stretch fallback, and surfaced the decision in the README so the training path uses one canonical model choice. | Decision is recorded and all team members know which model will be fine tuned | Yes - verified by the decision record and README update |
112
+
113
+ ### Kush (Person D) - Completed on behalf of others
114
+
115
+ | ID | Epic | Assigned To | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
116
+ |----|------|------------|------|-------------|------|---------------|--------------------|---------|
117
+ | FND 03 | E01 | Max (Person C) | Initialize React plus Vite frontend shell | `frontend/package.json`, `frontend/src/`, `frontend/public/` | 2026-03-08 | Imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, component library, assets, and TypeScript config. | `npm install` and dev server run successfully | Yes - verified with `npm --prefix frontend install` and `npm --prefix frontend run build` |
118
+ | FND 12 | E01 | Max (Person C) | Create Vite config with API and WebSocket proxy support plus stable build output settings | `frontend/vite.config.ts` | 2026-03-08 | Imported Kush's Vite configuration with `@` alias plus `/api` and `/ws` proxy rules, then verified the frontend builds successfully against that config on `ayush`. | Frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | Yes - verified with `npm --prefix frontend run build` |
119
 
120
  ### Shared Tasks - Completed
121
 
 
143
  |---------------|-------------------|
144
  | FND 01 | FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10 |
145
  | FND 02 | FND 11 |
146
+ | FND 03 | FND 12, FND 13, UI 01 |
147
  | FND 04 | FND 08, FND 09 |
148
  | FND 05 | No downstream dependencies |
149
  | FND 06 | DOC 01 |
 
152
  | FND 09 | OpenEnv registration layer is now present for later `/web` and deployment work |
153
  | FND 10 | No downstream dependencies |
154
  | FND 11 | No new formal dependencies, but server scaffold work can now install from a standalone requirements file |
155
+ | FND 12 | Frontend dev proxying is now configured for local API and WebSocket work |
156
  | MOD 01 | MOD 05, MOD 09 |
157
  | MOD 02 | No new formal dependencies, but the Lab Manager contract is now stable for later policy work |
158
  | MOD 03 | MOD 04, MOD 11 |
159
+ | MOD 04 | MOD 07, ENV 01 |
160
+ | MOD 05 | MOD 06, AGT 05 |
161
+ | MOD 11 | No new formal dependency edge by itself, but `StepResult` metadata is now stable for environment, API, replay, and training consumers |
162
+ | MOD 12 | Shared defaults now come from `replicalab/config.py`, reducing config drift before environment and scoring work expands |
163
+ | SCN 01 | SCN 09 now has a deterministic seed utility to build on |
164
+ | SCN 02 | SCN 03, SCN 04, SCN 05, SCN 07 |
165
+ | SCN 03 | SCN 06, SCN 08 |
166
+ | SCN 04 | SCN 06, SCN 08 |
167
+ | SCN 05 | SCN 06, SCN 08 |
168
+ | SCN 06 | Harder scenario variants and curriculum-ready difficulty scaling now exist |
169
+ | SCN 07 | `AGT 05` is complete; `AGT 06`, `AGT 07`, `JDG 02`, and `SCN 13` are now unblocked from the normalized resource layer |
170
+ | SCN 08 | `AGT 06` is now unblocked; `JDG 01` and `JDG 03` are also unblocked |
171
+ | SCN 09 | SCN 10, SCN 11, ENV 01, ENV 02 |
172
+ | SCN 10 | Scenario determinism and consistency now have regression coverage |
173
+ | SCN 11 | AGT 01, TRN 08 |
174
+ | MOD 09 | Together with completed `AGT 02`, `AGT 03` is now unblocked |
175
+ | AGT 01 | AGT 02, AGT 11, TRN 04 |
176
+ | AGT 02 | AGT 03, AGT 04 |
177
+ | AGT 04 | Removes the last baseline-policy blocker; `AGT 08` now only waits on `AGT 03` |
178
+ | AGT 05 | AGT 06, AGT 07, JDG 02 |
179
+ | AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
180
+ | AGT 07 | `AGT 10` now only waits on `JDG 06`, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
181
+ | AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
182
 
183
  ### Current Unblocked and Active Tasks
184
 
185
  | ID | Owner | Task | Unblocked By |
186
  |----|-------|------|-------------|
187
+ | FND 13 | Kush (Person D) | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 |
188
+ | UI 01 | Kush (Person D) | Create application shell with three panel layout | FND 03 |
189
+ | AGT 03 | Person B (Ayush) | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 |
190
+ | MOD 06 | Kian (Person A) | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 |
191
+ | MOD 07 | Max (Person C) | Add state serialization helper for replay logs | MOD 04 |
192
+ | JDG 01 | Kian (Person A) | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | SCN 08 |
193
+ | JDG 02 | Kian (Person A) | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | SCN 07, AGT 05 |
194
+ | JDG 03 | Kian (Person A) | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | SCN 08 |
195
+ | SCN 13 | Kian (Person A) | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 |
196
+ | ENV 01 | Kian (Person A) | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 |
197
  | DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
198
 
199
  ---
 
202
 
203
  | Epic | Total Tasks | Completed | Rate |
204
  |------|------------|-----------|------|
205
+ | E01. Foundations and repository setup | 13 | 12 | 92.31% |
206
+ | E02. Domain models, validation, state contracts | 12 | 8 | 66.67% |
207
+ | E03. Scenario engine and constraint generation | 13 | 11 | 84.62% |
208
+ | E04. Scientist agent and Lab Manager policy | 11 | 7 | 63.64% |
209
  | E05. Judge engine and reward logic | 11 | 0 | 0% |
210
  | E06. OpenEnv environment implementation | 11 | 0 | 0% |
211
  | E07. API, server, Docker, deployment | 19 | 0 | 0% |
docs/kian/task_breakdown.md CHANGED
@@ -6,32 +6,28 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
6
 
7
  ## Current status
8
 
9
- - `FND 04` complete
10
- - `FND 08` complete
11
- - `FND 09` complete
12
- - `MOD 01` complete
13
- - `MOD 02` complete
14
- - `MOD 03` complete
15
- - The Kian lane should now move to config, scenario seeding, validation, the remaining state-model pass, and the normalized scenario layer
16
 
17
  ---
18
 
19
  ## Recommended execution order
20
 
21
- 1. `MOD 12` -- creates one shared config module before env and scoring files branch out
22
- 2. `SCN 01` -- starts the deterministic scenario utility chain
23
- 3. `MOD 05` -- converts the frozen action contract into reusable protocol validation
24
- 4. `MOD 04` -- upgrades state and replay models to match the typed observation path
25
- 5. `SCN 02` -- defines the normalized scenario pack below the stable outer contract
26
- 6. `MOD 11` -- finalizes `StepResult` now that the observation wrapper is typed
27
 
28
  ---
29
 
30
  ## Why this order
31
 
32
- - `MOD 12` and `SCN 01` are the cleanest foundational follow-ons and reduce future magic numbers and seed drift.
33
- - `MOD 05` is already unblocked and should land before higher-level environment logic starts trusting protocol payloads.
34
- - `MOD 04` should land before `SCN 02` so the normalized scenario pack can be threaded into `EpisodeState` cleanly rather than retrofitted later.
35
- - `SCN 02` is now the key architecture task because it formalizes the normalized pack that mathematics, machine learning, and finance scenarios all have to emit while keeping the outer contract unchanged.
36
- - `MOD 11` now follows naturally from the typed observation work in `MOD 03`.
37
-
 
6
 
7
  ## Current status
8
 
9
+ - `FND 04`, `FND 08`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
10
+ - Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the later judge feasibility score
11
+ - `SCN 01` to `SCN 10` are also complete, so the deterministic scenario layer now exists in code
12
+ - The Kian lane no longer needs to start with scenario seeding or template scaffolding
13
+ - The remaining high-leverage work is semantic edge-case validation, booking conflicts, judge logic, and the real environment
 
 
14
 
15
  ---
16
 
17
  ## Recommended execution order
18
 
19
+ 1. `MOD 06` -- extend the new semantic validation layer to catch impossible edge cases early
20
+ 2. `SCN 13` -- deepen the normalized scenario layer with booking and scheduling conflicts
21
+ 3. `JDG 01`, `JDG 02`, and `JDG 03` -- start the deterministic reward components that are now unblocked
22
+ 4. `JDG 04` and `JDG 05` -- complete the reward pipeline once the component scorers exist
23
+ 5. `ENV 01` and `ENV 02` -- once typed state and core scoring pieces are in place, start the real OpenEnv environment path
 
24
 
25
  ---
26
 
27
  ## Why this order
28
 
29
+ - `MOD 06` is the smallest remaining contract-hardening task and builds directly on the completed `MOD 05` validator.
30
+ - `SCN 13` is the remaining scenario-layer depth task; it builds naturally on the completed normalized resource model.
31
+ - `JDG 01` and `JDG 03` can start immediately because their only formal prerequisite, `SCN 08`, is already complete.
32
+ - `JDG 02` is now also unblocked because the deterministic feasibility checker from `AGT 05` exists.
33
+ - The environment path can now start from typed state and step-result contracts instead of loose dict-based placeholders.
 
docs/kian/task_list.md CHANGED
@@ -6,29 +6,28 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
6
 
7
  ## Current status
8
 
9
- - `FND 04` is complete in `replicalab/models.py`
10
- - `FND 08` is complete in `docs/fnd08_frozen_json_contract.md`
11
- - `FND 09` is complete in `openenv.yaml`
12
- - `MOD 01` is now complete in `replicalab/models.py`
13
- - `MOD 02` is now complete in `replicalab/models.py`
14
- - `MOD 03` is now complete in `replicalab/models.py`
15
- - The next Kian-lane tasks are `MOD 12`, `SCN 01`, `MOD 05`, `MOD 04`, `SCN 02`, and `MOD 11`
16
- - `SCN 02` now needs to formalize the normalized scenario pack below the stable outer contract
17
 
18
  ---
19
 
20
  ## Immediate next tasks
21
 
22
- - [ ] **MOD 12** | Create environment configuration module with shared constants | 0.5h | Depends: FND 08
23
- - [ ] **SCN 01** | Implement deterministic RNG helper `seed_rng()` | 0.5h | Depends: FND 08
24
- - [ ] **MOD 05** | Add protocol validation for sample size, controls, duration, and vocab checks | 1h | Depends: MOD 01
25
- - [ ] **MOD 04** | Implement `EpisodeState` and `EpisodeLog` models | 0.75h | Depends: MOD 03
26
- - [ ] **SCN 02** | Define normalized scenario schema with hidden reference spec and mapper-ready inputs | 0.75h | Depends: MOD 04
27
- - [ ] **MOD 11** | Implement `StepResult` model | 0.5h | Depends: MOD 03
28
 
29
  ---
30
 
31
- ## Foundation tasks already landed
32
 
33
  - [x] **FND 04** | Completed by Person B (Ayush)
34
  - [x] **FND 08** | Completed with shared sign-off
@@ -36,4 +35,18 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
36
  - [x] **MOD 01** | Completed by Person B (Ayush)
37
  - [x] **MOD 02** | Completed by Person B (Ayush)
38
  - [x] **MOD 03** | Completed by Person B (Ayush)
39
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ## Current status
8
 
9
+ - `FND 04`, `FND 08`, and `FND 09` are complete
10
+ - `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
11
+ - Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
12
+ - `SCN 01` to `SCN 10` are now complete in the repo
13
+ - The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
14
+ - The next Kian-lane tasks are now `MOD 06`, `SCN 13`, `JDG 01`, `JDG 02`, `JDG 03`, and `ENV 01`
15
+ - `MOD 05` and shared `AGT 05` now exist, so the judge and environment path can build on real scenario-grounded checks instead of placeholder rules
 
16
 
17
  ---
18
 
19
  ## Immediate next tasks
20
 
21
+ - [ ] **MOD 06** | Add semantic validators for impossible plans such as zero sample size with positive controls | 0.75h | Depends: MOD 05
22
+ - [ ] **SCN 13** | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time-slot conflicts and duration | 1h | Depends: SCN 07
23
+ - [ ] **JDG 01** | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | 1.25h | Depends: SCN 08
24
+ - [ ] **JDG 02** | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | 1.25h | Depends: SCN 07, AGT 05
25
+ - [ ] **JDG 03** | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | 1h | Depends: SCN 08
26
+ - [ ] **ENV 01** | Create `ReplicaLabEnv` class skeleton | 0.5h | Depends: MOD 04, SCN 09
27
 
28
  ---
29
 
30
+ ## Foundation and scenario tasks already landed
31
 
32
  - [x] **FND 04** | Completed by Person B (Ayush)
33
  - [x] **FND 08** | Completed with shared sign-off
 
35
  - [x] **MOD 01** | Completed by Person B (Ayush)
36
  - [x] **MOD 02** | Completed by Person B (Ayush)
37
  - [x] **MOD 03** | Completed by Person B (Ayush)
38
+ - [x] **MOD 04** | Completed by Person B (Ayush)
39
+ - [x] **MOD 05** | Completed by Person B (Ayush)
40
+ - [x] **MOD 11** | Completed by Person B (Ayush)
41
+ - [x] **MOD 12** | Completed by Person B (Ayush)
42
+ - [x] **AGT 05** | Completed by Person B (Ayush)
43
+ - [x] **SCN 01** | Completed by Person B (Ayush)
44
+ - [x] **SCN 02** | Completed by Person B (Ayush)
45
+ - [x] **SCN 03** | Completed by Person B (Ayush)
46
+ - [x] **SCN 04** | Completed by Person B (Ayush)
47
+ - [x] **SCN 05** | Completed by Person B (Ayush)
48
+ - [x] **SCN 06** | Completed by Person B (Ayush)
49
+ - [x] **SCN 07** | Completed by Person B (Ayush)
50
+ - [x] **SCN 08** | Completed by Person B (Ayush)
51
+ - [x] **SCN 09** | Completed by Person B (Ayush)
52
+ - [x] **SCN 10** | Completed by Person B (Ayush)
docs/kush/task_breakdown.md CHANGED
@@ -8,12 +8,12 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
8
 
9
  - `FND 06` is already complete and was executed by `Person B (Ayush)`
10
  - The README stub now exists, so the next unblocked documentation task is `DOC 01`
11
- - `FND 13` remains blocked until Max (Person C) lands `FND 03`
 
12
 
13
  ---
14
 
15
  ## Execution order
16
 
17
- 1. `DOC 01` improve the README opening now that the temporary stub exists
18
- 2. `FND 13` once `FND 03` exists, add Tailwind and base style config
19
-
 
8
 
9
  - `FND 06` is already complete and was executed by `Person B (Ayush)`
10
  - The README stub now exists, so the next unblocked documentation task is `DOC 01`
11
+ - The validated frontend import from Kush's branch is now on `ayush`
12
+ - `FND 13` is now unblocked because `FND 03` is complete
13
 
14
  ---
15
 
16
  ## Execution order
17
 
18
+ 1. `DOC 01` - improve the README opening now that the temporary stub exists
19
+ 2. `FND 13` - reconcile the Tailwind plus base style pipeline with the imported frontend and the current source-of-truth file layout
 
docs/kush/task_list.md CHANGED
@@ -8,12 +8,13 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
8
 
9
  - `FND 06` is complete and was executed by `Person B (Ayush)`
10
  - `DOC 01` is now unblocked because `FND 06` is complete
11
- - `FND 13` is still blocked on `FND 03`
 
12
 
13
  ---
14
 
15
  ## Immediate next tasks
16
 
17
  - [ ] **DOC 01** | Write hook, problem statement, and one-line product summary | Depends: `FND 06` | Status: unblocked
18
- - [ ] **FND 13** | Configure Tailwind and base styling pipeline | Depends: `FND 03` | Status: blocked
19
 
 
8
 
9
  - `FND 06` is complete and was executed by `Person B (Ayush)`
10
  - `DOC 01` is now unblocked because `FND 06` is complete
11
+ - The frontend shell from Kush's branch is now on `ayush` and builds successfully
12
+ - `FND 13` is now unblocked because `FND 03` is complete
13
 
14
  ---
15
 
16
  ## Immediate next tasks
17
 
18
  - [ ] **DOC 01** | Write hook, problem statement, and one-line product summary | Depends: `FND 06` | Status: unblocked
19
+ - [ ] **FND 13** | Configure Tailwind and base styling pipeline | Depends: `FND 03` | Status: unblocked
20
 
docs/max/task_breakdown.md CHANGED
@@ -8,24 +8,24 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
8
 
9
  - `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are already complete
10
  - Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
 
11
  - `FND 11` is now complete and verified
12
  - A normalized backend import from Max's PR is on `ayush`: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md`
13
  - That backend import is intentionally tracked as partial because it still runs on the stub env and Docker has not yet been validated locally
14
- - Max's next untouched implementation priority in the Person C lane is `FND 03`
15
 
16
  ---
17
 
18
  ## Unblocked now
19
 
20
- 1. `FND 03` React plus Vite frontend shell
21
- 2. Convert the stub-backed API tasks to real-env-backed implementations once Kian lands the environment work
22
 
23
  ---
24
 
25
  ## Still blocked
26
 
27
- - `FND 12` depends on `FND 03`
28
- - `FND 13` depends on `FND 03` even though it is owned by Kush (Person D)
29
  - Real completion of `API 01`, `API 02`, `API 03`, `API 06`, and `API 07` depends on Kian's environment tasks
30
  - Real completion of `API 08` depends on local Docker build and run validation
31
 
@@ -33,9 +33,6 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
33
 
34
  ## Recommended execution order
35
 
36
- 1. Finish `FND 03`
37
- 2. Land `FND 12`
38
- 3. Re-validate the imported server scaffold against Kian's environment implementation
39
- 4. Validate `server/Dockerfile` locally
40
- 5. Continue into deployment and replay work once the real env path is stable
41
-
 
8
 
9
  - `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are already complete
10
  - Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
11
+ - `FND 03` and `FND 12` are now complete via the validated frontend import from Kush's branch onto `ayush`
12
  - `FND 11` is now complete and verified
13
  - A normalized backend import from Max's PR is on `ayush`: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md`
14
  - That backend import is intentionally tracked as partial because it still runs on the stub env and Docker has not yet been validated locally
15
+ - Max's remaining implementation priority is the real-env-backed API and deployment path
16
 
17
  ---
18
 
19
  ## Unblocked now
20
 
21
+ 1. Convert the stub-backed API tasks to real-env-backed implementations once Kian lands the environment work
22
+ 2. Validate Docker locally once the real env path is in place
23
 
24
  ---
25
 
26
  ## Still blocked
27
 
28
+ - `FND 13` is now unblocked because `FND 03` is complete, but it remains owned by Kush (Person D)
 
29
  - Real completion of `API 01`, `API 02`, `API 03`, `API 06`, and `API 07` depends on Kian's environment tasks
30
  - Real completion of `API 08` depends on local Docker build and run validation
31
 
 
33
 
34
  ## Recommended execution order
35
 
36
+ 1. Re-validate the imported server scaffold against Kian's environment implementation
37
+ 2. Validate `server/Dockerfile` locally
38
+ 3. Continue into deployment and replay work once the real env path is stable
 
 
 
docs/max/task_list.md CHANGED
@@ -8,17 +8,17 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
8
 
9
  - `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are complete
10
  - All five were executed by `Person B (Ayush)` and recorded as executor deviations
 
11
  - `FND 11` is complete
 
12
  - A stub-backed backend server scaffold now exists in `server/app.py`
13
  - `API 01`, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 13`, `API 14`, and `OBS 02` are partial pending real-env and Docker-level verification
14
- - The next unblocked untouched Max task is `FND 03`
15
 
16
  ---
17
 
18
  ## Immediate next tasks
19
 
20
- - [ ] **FND 03** | Initialize React plus Vite frontend shell | Depends: `FND 01` | Status: unblocked
21
- - [ ] **FND 12** | Create `frontend/vite.config.ts` with proxy settings | Depends: `FND 03` | Status: blocked
22
  - [ ] **API 01 / API 02 / API 03 / API 06** | Convert the stub-backed server scaffold into real-env-backed endpoints | Depends: `ENV 01`, `ENV 02`, `ENV 06` | Status: partial
23
  - [ ] **API 08** | Validate Docker locally for the server image | Depends: `API 01` to `API 07` | Status: partial
24
  - [ ] **OBS 02** | Confirm logging behavior against the integrated environment path | Depends: `API 01` | Status: partial
@@ -29,9 +29,11 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
29
 
30
  - [x] **FND 01** | Completed by Person B (Ayush)
31
  - [x] **FND 02** | Completed by Person B (Ayush)
 
32
  - [x] **FND 05** | Completed by Person B (Ayush)
33
  - [x] **FND 07** | Completed by Person B (Ayush)
34
  - [x] **FND 10** | Completed by Person B (Ayush)
 
35
 
36
  ## Completed in Max's lane
37
 
 
8
 
9
  - `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are complete
10
  - All five were executed by `Person B (Ayush)` and recorded as executor deviations
11
+ - `FND 03` is complete via the validated frontend import from Kush's branch onto `ayush`
12
  - `FND 11` is complete
13
+ - `FND 12` is complete via the imported and validated `frontend/vite.config.ts`
14
  - A stub-backed backend server scaffold now exists in `server/app.py`
15
  - `API 01`, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 13`, `API 14`, and `OBS 02` are partial pending real-env and Docker-level verification
16
+ - The remaining Max work is now the API, Docker, deployment, replay, and observability path
17
 
18
  ---
19
 
20
  ## Immediate next tasks
21
 
 
 
22
  - [ ] **API 01 / API 02 / API 03 / API 06** | Convert the stub-backed server scaffold into real-env-backed endpoints | Depends: `ENV 01`, `ENV 02`, `ENV 06` | Status: partial
23
  - [ ] **API 08** | Validate Docker locally for the server image | Depends: `API 01` to `API 07` | Status: partial
24
  - [ ] **OBS 02** | Confirm logging behavior against the integrated environment path | Depends: `API 01` | Status: partial
 
29
 
30
  - [x] **FND 01** | Completed by Person B (Ayush)
31
  - [x] **FND 02** | Completed by Person B (Ayush)
32
+ - [x] **FND 03** | Completed by Kush and imported onto `ayush`
33
  - [x] **FND 05** | Completed by Person B (Ayush)
34
  - [x] **FND 07** | Completed by Person B (Ayush)
35
  - [x] **FND 10** | Completed by Person B (Ayush)
36
+ - [x] **FND 12** | Completed by Kush and imported onto `ayush`
37
 
38
  ## Completed in Max's lane
39
 
replicalab/agents/__init__.py CHANGED
@@ -5,6 +5,7 @@ from .lab_manager_policy import (
5
  FeasibilityCheckResult,
6
  SuggestionChange,
7
  check_feasibility,
 
8
  suggest_alternative,
9
  )
10
  from .scientist_policy import (
@@ -29,6 +30,7 @@ __all__ = [
29
  "build_scientist_system_prompt",
30
  "call_scientist_with_retry",
31
  "check_feasibility",
 
32
  "format_scientist_observation",
33
  "parse_scientist_output",
34
  "suggest_alternative",
 
5
  FeasibilityCheckResult,
6
  SuggestionChange,
7
  check_feasibility,
8
+ compose_lab_manager_response,
9
  suggest_alternative,
10
  )
11
  from .scientist_policy import (
 
30
  "build_scientist_system_prompt",
31
  "call_scientist_with_retry",
32
  "check_feasibility",
33
+ "compose_lab_manager_response",
34
  "format_scientist_observation",
35
  "parse_scientist_output",
36
  "suggest_alternative",
replicalab/agents/lab_manager_policy.py CHANGED
@@ -5,15 +5,19 @@ normalized scenario pack and returns stable pass/fail status per dimension.
5
  AGT 06 adds ``suggest_alternative`` which mechanically applies substitution
6
  rules, clamps duration, and reduces sample size to produce a concrete
7
  revised protocol with a post-fix feasibility recheck.
 
 
 
 
8
  """
9
 
10
  from __future__ import annotations
11
 
12
- from typing import Optional
13
 
14
  from pydantic import BaseModel, ConfigDict, Field, computed_field
15
 
16
- from replicalab.models import Protocol
17
  from replicalab.scenarios import NormalizedScenarioPack
18
  from replicalab.utils.validation import ValidationResult, validate_protocol
19
 
@@ -182,6 +186,12 @@ class AlternativeSuggestion(BaseModel):
182
  post_check: FeasibilityCheckResult
183
 
184
 
 
 
 
 
 
 
185
  def suggest_alternative(
186
  protocol: Protocol,
187
  check_result: FeasibilityCheckResult,
@@ -300,6 +310,55 @@ def suggest_alternative(
300
  )
301
 
302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
  def _apply_substitutions(
304
  items: list[str],
305
  substitution_options: dict[str, list[str]],
@@ -385,6 +444,125 @@ def _build_tradeoff_index(scenario: NormalizedScenarioPack) -> dict[str, str]:
385
  return index
386
 
387
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
388
  def _build_protocol_check(validation_result: ValidationResult) -> DimensionCheck:
389
  reasons = [issue.message for issue in validation_result.issues]
390
  return DimensionCheck(ok=validation_result.valid, reasons=reasons)
 
5
  AGT 06 adds ``suggest_alternative`` which mechanically applies substitution
6
  rules, clamps duration, and reduces sample size to produce a concrete
7
  revised protocol with a post-fix feasibility recheck.
8
+ AGT 07 adds ``compose_lab_manager_response`` which converts those grounded
9
+ results into a typed ``LabManagerAction`` with stable flags plus a readable
10
+ explanation. An optional explanation renderer can add richer language later
11
+ without taking over the verdict or constraint fields.
12
  """
13
 
14
  from __future__ import annotations
15
 
16
+ from typing import Callable, Optional
17
 
18
  from pydantic import BaseModel, ConfigDict, Field, computed_field
19
 
20
+ from replicalab.models import LabManagerAction, LabManagerActionType, Protocol
21
  from replicalab.scenarios import NormalizedScenarioPack
22
  from replicalab.utils.validation import ValidationResult, validate_protocol
23
 
 
186
  post_check: FeasibilityCheckResult
187
 
188
 
189
+ ExplanationRenderer = Callable[
190
+ [LabManagerActionType, FeasibilityCheckResult, Optional[AlternativeSuggestion]],
191
+ str,
192
+ ]
193
+
194
+
195
  def suggest_alternative(
196
  protocol: Protocol,
197
  check_result: FeasibilityCheckResult,
 
310
  )
311
 
312
 
313
+ def compose_lab_manager_response(
314
+ check_result: FeasibilityCheckResult,
315
+ suggestion: Optional[AlternativeSuggestion] = None,
316
+ *,
317
+ explanation_renderer: Optional[ExplanationRenderer] = None,
318
+ ) -> LabManagerAction:
319
+ """Compose a grounded ``LabManagerAction`` from deterministic inputs.
320
+
321
+ The verdict and all feasibility flags remain deterministic. Callers may
322
+ optionally inject an ``explanation_renderer`` for richer wording, but it
323
+ never controls the action type or the pass/fail flags.
324
+ """
325
+
326
+ action_type = _select_lab_manager_action_type(check_result, suggestion)
327
+ explanation = (
328
+ explanation_renderer(action_type, check_result, suggestion)
329
+ if explanation_renderer is not None
330
+ else _build_default_explanation(action_type, check_result, suggestion)
331
+ ).strip()
332
+ if not explanation:
333
+ raise ValueError("Lab Manager explanation must be non-empty")
334
+
335
+ suggested_protocol = (
336
+ suggestion.revised_protocol
337
+ if action_type is LabManagerActionType.SUGGEST_ALTERNATIVE and suggestion is not None
338
+ else None
339
+ )
340
+
341
+ return LabManagerAction(
342
+ action_type=action_type,
343
+ feasible=_lab_constraints_feasible(check_result),
344
+ budget_ok=check_result.budget_ok,
345
+ equipment_ok=check_result.equipment_ok,
346
+ reagents_ok=check_result.reagents_ok,
347
+ schedule_ok=check_result.schedule_ok,
348
+ staff_ok=check_result.staff_ok,
349
+ suggested_technique=(
350
+ suggested_protocol.technique if suggested_protocol is not None else ""
351
+ ),
352
+ suggested_sample_size=(
353
+ suggested_protocol.sample_size if suggested_protocol is not None else 0
354
+ ),
355
+ suggested_controls=(
356
+ list(suggested_protocol.controls) if suggested_protocol is not None else []
357
+ ),
358
+ explanation=explanation,
359
+ )
360
+
361
+
362
  def _apply_substitutions(
363
  items: list[str],
364
  substitution_options: dict[str, list[str]],
 
444
  return index
445
 
446
 
447
+ def _select_lab_manager_action_type(
448
+ check_result: FeasibilityCheckResult,
449
+ suggestion: Optional[AlternativeSuggestion],
450
+ ) -> LabManagerActionType:
451
+ """Choose the outward action mode from grounded feasibility results."""
452
+
453
+ lab_constraints_ok = _lab_constraints_feasible(check_result)
454
+
455
+ if lab_constraints_ok and check_result.protocol.ok and check_result.policy.ok:
456
+ return LabManagerActionType.ACCEPT
457
+
458
+ if suggestion is not None and suggestion.applied_changes:
459
+ return LabManagerActionType.SUGGEST_ALTERNATIVE
460
+
461
+ if lab_constraints_ok:
462
+ return LabManagerActionType.REPORT_FEASIBILITY
463
+
464
+ return LabManagerActionType.REJECT
465
+
466
+
467
+ def _build_default_explanation(
468
+ action_type: LabManagerActionType,
469
+ check_result: FeasibilityCheckResult,
470
+ suggestion: Optional[AlternativeSuggestion],
471
+ ) -> str:
472
+ """Render a deterministic human-readable explanation."""
473
+
474
+ if action_type is LabManagerActionType.ACCEPT:
475
+ return f"Accepted. {check_result.summary}"
476
+
477
+ if action_type is LabManagerActionType.SUGGEST_ALTERNATIVE and suggestion is not None:
478
+ parts = [
479
+ "Current proposal is not feasible under the present lab constraints.",
480
+ _format_reason_block(check_result, include_protocol=False, include_policy=False),
481
+ "Suggested revision: "
482
+ + " ".join(_format_change_sentence(change) for change in suggestion.applied_changes),
483
+ ]
484
+ if suggestion.remaining_failures:
485
+ parts.append(
486
+ "Remaining issues after the suggested revision: "
487
+ + ", ".join(suggestion.remaining_failures)
488
+ + "."
489
+ )
490
+ return " ".join(part for part in parts if part)
491
+
492
+ if action_type is LabManagerActionType.REPORT_FEASIBILITY:
493
+ parts = [
494
+ "Feasibility report: lab resources and schedule are workable, but the current proposal still needs revision.",
495
+ _format_reason_block(check_result, include_protocol=True, include_policy=True),
496
+ ]
497
+ return " ".join(part for part in parts if part)
498
+
499
+ parts = [
500
+ "Rejected. No deterministic revision could satisfy the current lab constraints.",
501
+ _format_reason_block(check_result, include_protocol=False, include_policy=False),
502
+ ]
503
+ return " ".join(part for part in parts if part)
504
+
505
+
506
+ def _lab_constraints_feasible(check_result: FeasibilityCheckResult) -> bool:
507
+ return all(
508
+ (
509
+ check_result.budget_ok,
510
+ check_result.equipment_ok,
511
+ check_result.reagents_ok,
512
+ check_result.schedule_ok,
513
+ check_result.staff_ok,
514
+ )
515
+ )
516
+
517
+
518
+ def _format_reason_block(
519
+ check_result: FeasibilityCheckResult,
520
+ *,
521
+ include_protocol: bool,
522
+ include_policy: bool,
523
+ ) -> str:
524
+ blocks: list[str] = []
525
+ for name, check in _iter_dimension_checks(
526
+ check_result,
527
+ include_protocol=include_protocol,
528
+ include_policy=include_policy,
529
+ ):
530
+ if check.ok or not check.reasons:
531
+ continue
532
+ blocks.append(f"{name}: {' '.join(check.reasons)}")
533
+ return " ".join(blocks)
534
+
535
+
536
+ def _format_change_sentence(change: SuggestionChange) -> str:
537
+ return (
538
+ f"{change.field} changed from {change.original} to {change.revised}. "
539
+ f"{change.reason} Tradeoff: {change.tradeoff}"
540
+ )
541
+
542
+
543
+ def _iter_dimension_checks(
544
+ check_result: FeasibilityCheckResult,
545
+ *,
546
+ include_protocol: bool,
547
+ include_policy: bool,
548
+ ) -> list[tuple[str, DimensionCheck]]:
549
+ checks: list[tuple[str, DimensionCheck]] = []
550
+ if include_protocol:
551
+ checks.append(("protocol", check_result.protocol))
552
+ checks.extend(
553
+ [
554
+ ("budget", check_result.budget),
555
+ ("equipment", check_result.equipment),
556
+ ("reagents", check_result.reagents),
557
+ ("schedule", check_result.schedule),
558
+ ("staff", check_result.staff),
559
+ ]
560
+ )
561
+ if include_policy:
562
+ checks.append(("policy", check_result.policy))
563
+ return checks
564
+
565
+
566
  def _build_protocol_check(validation_result: ValidationResult) -> DimensionCheck:
567
  reasons = [issue.message for issue in validation_result.issues]
568
  return DimensionCheck(ok=validation_result.valid, reasons=reasons)
replicalab/agents/scientist_policy.py CHANGED
@@ -7,6 +7,8 @@ instead of hard-coded domain text. AGT 02 adds the per-turn observation
7
  formatter that converts a ``ScientistObservation`` into the user message
8
  sent to the LLM each round. AGT 03 wraps the formatter and parser in a
9
  retry loop with error-specific correction prompts and exposed telemetry.
 
 
10
  """
11
 
12
  from __future__ import annotations
@@ -30,6 +32,44 @@ from replicalab.scenarios import NormalizedScenarioPack
30
  log = logging.getLogger(__name__)
31
 
32
  _JSON_FENCE_RE = re.compile(r"```(?:json)?\s*(.*?)```", re.IGNORECASE | re.DOTALL)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
 
35
  class ScientistOutputParseError(ValueError):
@@ -301,6 +341,30 @@ def format_scientist_observation(obs: ScientistObservation) -> str:
301
  return "\n\n".join(sections)
302
 
303
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304
  def _render_history(entries: list[ConversationEntry]) -> str:
305
  lines: list[str] = []
306
  for entry in entries:
@@ -510,3 +574,119 @@ def _render_substitutions(pack: NormalizedScenarioPack) -> str:
510
  )
511
  )
512
  return "\n".join(lines)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  formatter that converts a ``ScientistObservation`` into the user message
8
  sent to the LLM each round. AGT 03 wraps the formatter and parser in a
9
  retry loop with error-specific correction prompts and exposed telemetry.
10
+ AGT 04 adds a deterministic baseline Scientist so smoke tests can run
11
+ without a trained model.
12
  """
13
 
14
  from __future__ import annotations
 
32
  log = logging.getLogger(__name__)
33
 
34
  _JSON_FENCE_RE = re.compile(r"```(?:json)?\s*(.*?)```", re.IGNORECASE | re.DOTALL)
35
+ _ML_HINTS = (
36
+ "benchmark",
37
+ "dataset",
38
+ "accuracy",
39
+ "tokenizer",
40
+ "train",
41
+ "gpu",
42
+ "cifar",
43
+ "ag news",
44
+ "bert",
45
+ "resnet",
46
+ )
47
+ _FINANCE_HINTS = (
48
+ "backtest",
49
+ "drawdown",
50
+ "sharpe",
51
+ "trading",
52
+ "slippage",
53
+ "capital",
54
+ "spy",
55
+ "qqq",
56
+ "futures",
57
+ )
58
+ _BLOCKER_HINTS = (
59
+ "booked",
60
+ "unavailable",
61
+ "not available",
62
+ "exceeds",
63
+ "tight",
64
+ "limited",
65
+ "deadline",
66
+ "budget",
67
+ "cost",
68
+ "drawdown",
69
+ "slippage",
70
+ "risk",
71
+ "conflict",
72
+ )
73
 
74
 
75
  class ScientistOutputParseError(ValueError):
 
341
  return "\n\n".join(sections)
342
 
343
 
344
+ def build_baseline_scientist_action(
345
+ observation: ScientistObservation,
346
+ ) -> ScientistAction:
347
+ """Return a deterministic non-LLM Scientist action for smoke tests.
348
+
349
+ The baseline follows a conservative policy:
350
+ - propose a valid protocol when no protocol exists yet
351
+ - revise the current protocol if the latest Lab Manager message contains
352
+ an obvious feasibility blocker
353
+ - otherwise accept the current protocol to complete the episode cleanly
354
+ """
355
+
356
+ latest_feedback = _latest_lab_manager_feedback(observation)
357
+
358
+ if observation.current_protocol is not None:
359
+ if observation.round_number >= max(1, observation.max_rounds - 1):
360
+ return _build_accept_action()
361
+ if latest_feedback and _feedback_requires_revision(latest_feedback.message):
362
+ return _build_revision_action(observation.current_protocol, latest_feedback)
363
+ return _build_accept_action()
364
+
365
+ return _build_initial_protocol_action(observation)
366
+
367
+
368
  def _render_history(entries: list[ConversationEntry]) -> str:
369
  lines: list[str] = []
370
  for entry in entries:
 
574
  )
575
  )
576
  return "\n".join(lines)
577
+
578
+
579
+ def _build_accept_action() -> ScientistAction:
580
+ return ScientistAction(
581
+ action_type=ScientistActionType.ACCEPT,
582
+ sample_size=0,
583
+ controls=[],
584
+ technique="",
585
+ duration_days=0,
586
+ required_equipment=[],
587
+ required_reagents=[],
588
+ questions=[],
589
+ rationale="",
590
+ )
591
+
592
+
593
+ def _build_initial_protocol_action(
594
+ observation: ScientistObservation,
595
+ ) -> ScientistAction:
596
+ domain = _infer_domain(observation)
597
+ defaults = _baseline_defaults_for_domain(domain)
598
+
599
+ return ScientistAction(
600
+ action_type=ScientistActionType.PROPOSE_PROTOCOL,
601
+ sample_size=defaults["sample_size"],
602
+ controls=list(defaults["controls"]),
603
+ technique=defaults["technique"],
604
+ duration_days=defaults["duration_days"],
605
+ required_equipment=[],
606
+ required_reagents=[],
607
+ questions=[],
608
+ rationale=(
609
+ f"Baseline proposal for {observation.paper_title}: "
610
+ f"use a concise {defaults['technique']} plan aligned to the stated goal "
611
+ f"'{observation.experiment_goal}'."
612
+ ),
613
+ )
614
+
615
+
616
+ def _build_revision_action(
617
+ protocol: Protocol,
618
+ feedback: ConversationEntry,
619
+ ) -> ScientistAction:
620
+ reduced_sample_size = max(1, protocol.sample_size // 2) if protocol.sample_size else 1
621
+ reduced_duration = max(1, protocol.duration_days - 1) if protocol.duration_days else 1
622
+ revised_controls = list(protocol.controls) or ["fallback_review"]
623
+
624
+ return ScientistAction(
625
+ action_type=ScientistActionType.REVISE_PROTOCOL,
626
+ sample_size=reduced_sample_size,
627
+ controls=revised_controls,
628
+ technique=protocol.technique,
629
+ duration_days=reduced_duration,
630
+ required_equipment=list(protocol.required_equipment),
631
+ required_reagents=list(protocol.required_reagents),
632
+ questions=[],
633
+ rationale=(
634
+ "Baseline revision reduces scope to address the latest Lab Manager "
635
+ f"concern: {feedback.message}"
636
+ ),
637
+ )
638
+
639
+
640
+ def _latest_lab_manager_feedback(
641
+ observation: ScientistObservation,
642
+ ) -> ConversationEntry | None:
643
+ for entry in reversed(observation.conversation_history):
644
+ if entry.role == "lab_manager":
645
+ return entry
646
+ return None
647
+
648
+
649
+ def _feedback_requires_revision(message: str) -> bool:
650
+ lowered = message.lower()
651
+ return any(token in lowered for token in _BLOCKER_HINTS)
652
+
653
+
654
+ def _infer_domain(observation: ScientistObservation) -> str:
655
+ haystack = " ".join(
656
+ [
657
+ observation.paper_title,
658
+ observation.paper_hypothesis,
659
+ observation.paper_method,
660
+ observation.paper_key_finding,
661
+ observation.experiment_goal,
662
+ ]
663
+ ).lower()
664
+
665
+ if any(token in haystack for token in _ML_HINTS):
666
+ return "machine_learning"
667
+ if any(token in haystack for token in _FINANCE_HINTS):
668
+ return "finance_trading"
669
+ return "mathematics"
670
+
671
+
672
+ def _baseline_defaults_for_domain(domain: str) -> dict[str, Any]:
673
+ if domain == "machine_learning":
674
+ return {
675
+ "sample_size": 8,
676
+ "controls": ["published_split_check", "heldout_evaluation"],
677
+ "technique": "published_split_replication",
678
+ "duration_days": 2,
679
+ }
680
+ if domain == "finance_trading":
681
+ return {
682
+ "sample_size": 12,
683
+ "controls": ["drawdown_guardrail", "offline_evaluation_split"],
684
+ "technique": "offline_backtest_workflow",
685
+ "duration_days": 2,
686
+ }
687
+ return {
688
+ "sample_size": 4,
689
+ "controls": ["equality_case_check", "final_verification_pass"],
690
+ "technique": "structured_proof_outline",
691
+ "duration_days": 1,
692
+ }
server/app.py CHANGED
@@ -31,6 +31,11 @@ from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
31
  from fastapi.middleware.cors import CORSMiddleware
32
  from pydantic import BaseModel
33
 
 
 
 
 
 
34
  from replicalab.config import (
35
  API_HOST,
36
  API_PORT,
@@ -40,11 +45,16 @@ from replicalab.config import (
40
  STUB_ACCEPT_REWARD,
41
  WS_IDLE_TIMEOUT_SECONDS,
42
  )
43
- from replicalab.scenarios import available_scenario_families, generate_scenario
 
 
 
 
44
  from replicalab.models import (
45
  ConversationEntry,
46
  EpisodeLog,
47
  EpisodeState,
 
48
  LabManagerObservation,
49
  Observation,
50
  Protocol,
@@ -121,6 +131,7 @@ class _StubEnv:
121
  self._state = EpisodeState()
122
  self._logs: list[ConversationEntry] = []
123
  self._episode_id: str = ""
 
124
 
125
  # ── public interface (matches ReplicaLabEnv) ──────────────────────────
126
 
@@ -133,6 +144,7 @@ class _StubEnv:
133
  self._episode_id = str(uuid.uuid4())
134
  self._logs = []
135
  pack = generate_scenario(seed=seed, template=scenario, difficulty=difficulty)
 
136
  self._state = EpisodeState(
137
  seed=seed,
138
  scenario_template=scenario,
@@ -160,10 +172,12 @@ class _StubEnv:
160
 
161
  def step(self, action: ScientistAction) -> StepResult:
162
  self._state.round_number += 1
 
163
  self._logs.append(self._scientist_log_entry(action))
164
- self._logs.append(self._lab_manager_log_entry(action))
 
165
  self._state.conversation_history = list(self._logs)
166
- self._state.current_protocol = self._protocol_from_action(action)
167
  done = (
168
  action.action_type == "accept"
169
  or self._state.round_number >= self._state.max_rounds
@@ -218,20 +232,39 @@ class _StubEnv:
218
  action_type=action_type,
219
  )
220
 
221
- def _lab_manager_log_entry(self, action: ScientistAction) -> ConversationEntry:
222
- if action.action_type == "accept":
223
- message = "Stub review: agreement recorded and episode will close."
224
- action_type = "accept"
225
- else:
226
- message = "Stub review: proposal received and remains feasible under the stub lab."
227
- action_type = "report_feasibility"
228
  return ConversationEntry(
229
  role="lab_manager",
230
- message=message,
231
  round_number=self._state.round_number,
232
  action_type=action_type,
233
  )
234
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
235
  def _protocol_from_action(self, action: ScientistAction) -> Optional[Protocol]:
236
  if action.action_type not in {"propose_protocol", "revise_protocol"}:
237
  return self._state.current_protocol
 
31
  from fastapi.middleware.cors import CORSMiddleware
32
  from pydantic import BaseModel
33
 
34
+ from replicalab.agents import (
35
+ check_feasibility,
36
+ compose_lab_manager_response,
37
+ suggest_alternative,
38
+ )
39
  from replicalab.config import (
40
  API_HOST,
41
  API_PORT,
 
45
  STUB_ACCEPT_REWARD,
46
  WS_IDLE_TIMEOUT_SECONDS,
47
  )
48
+ from replicalab.scenarios import (
49
+ NormalizedScenarioPack,
50
+ available_scenario_families,
51
+ generate_scenario,
52
+ )
53
  from replicalab.models import (
54
  ConversationEntry,
55
  EpisodeLog,
56
  EpisodeState,
57
+ LabManagerAction,
58
  LabManagerObservation,
59
  Observation,
60
  Protocol,
 
131
  self._state = EpisodeState()
132
  self._logs: list[ConversationEntry] = []
133
  self._episode_id: str = ""
134
+ self._scenario_pack: Optional[NormalizedScenarioPack] = None
135
 
136
  # ── public interface (matches ReplicaLabEnv) ──────────────────────────
137
 
 
144
  self._episode_id = str(uuid.uuid4())
145
  self._logs = []
146
  pack = generate_scenario(seed=seed, template=scenario, difficulty=difficulty)
147
+ self._scenario_pack = pack
148
  self._state = EpisodeState(
149
  seed=seed,
150
  scenario_template=scenario,
 
172
 
173
  def step(self, action: ScientistAction) -> StepResult:
174
  self._state.round_number += 1
175
+ proposed_protocol = self._protocol_from_action(action)
176
  self._logs.append(self._scientist_log_entry(action))
177
+ lab_manager_action = self._lab_manager_action(proposed_protocol)
178
+ self._logs.append(self._lab_manager_log_entry(lab_manager_action))
179
  self._state.conversation_history = list(self._logs)
180
+ self._state.current_protocol = proposed_protocol
181
  done = (
182
  action.action_type == "accept"
183
  or self._state.round_number >= self._state.max_rounds
 
232
  action_type=action_type,
233
  )
234
 
235
+ def _lab_manager_log_entry(self, action: LabManagerAction) -> ConversationEntry:
236
+ action_type = (
237
+ action.action_type.value
238
+ if hasattr(action.action_type, "value")
239
+ else str(action.action_type)
240
+ )
 
241
  return ConversationEntry(
242
  role="lab_manager",
243
+ message=action.explanation,
244
  round_number=self._state.round_number,
245
  action_type=action_type,
246
  )
247
 
248
+ def _lab_manager_action(self, protocol: Optional[Protocol]) -> LabManagerAction:
249
+ if protocol is None or self._scenario_pack is None:
250
+ return LabManagerAction(
251
+ action_type="report_feasibility",
252
+ feasible=True,
253
+ budget_ok=True,
254
+ equipment_ok=True,
255
+ reagents_ok=True,
256
+ schedule_ok=True,
257
+ staff_ok=True,
258
+ suggested_technique="",
259
+ suggested_sample_size=0,
260
+ suggested_controls=[],
261
+ explanation="No concrete protocol is available to review yet.",
262
+ )
263
+
264
+ check_result = check_feasibility(protocol, self._scenario_pack)
265
+ suggestion = suggest_alternative(protocol, check_result, self._scenario_pack)
266
+ return compose_lab_manager_response(check_result, suggestion)
267
+
268
  def _protocol_from_action(self, action: ScientistAction) -> Optional[Protocol]:
269
  if action.action_type not in {"propose_protocol", "revise_protocol"}:
270
  return self._state.current_protocol
tests/test_lab_manager_policy.py CHANGED
@@ -3,9 +3,10 @@ from __future__ import annotations
3
  from replicalab.agents.lab_manager_policy import (
4
  AlternativeSuggestion,
5
  check_feasibility,
 
6
  suggest_alternative,
7
  )
8
- from replicalab.models import Protocol
9
  from replicalab.scenarios import generate_scenario
10
 
11
 
@@ -316,3 +317,94 @@ def test_suggest_alternative_reports_remaining_failures() -> None:
316
  result = suggest_alternative(protocol, check, scenario)
317
  assert result is not None
318
  assert "policy" in result.remaining_failures
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  from replicalab.agents.lab_manager_policy import (
4
  AlternativeSuggestion,
5
  check_feasibility,
6
+ compose_lab_manager_response,
7
  suggest_alternative,
8
  )
9
+ from replicalab.models import LabManagerActionType, Protocol
10
  from replicalab.scenarios import generate_scenario
11
 
12
 
 
317
  result = suggest_alternative(protocol, check, scenario)
318
  assert result is not None
319
  assert "policy" in result.remaining_failures
320
+
321
+
322
+ # ---------------------------------------------------------------------------
323
+ # AGT 07 - compose_lab_manager_response
324
+ # ---------------------------------------------------------------------------
325
+
326
+
327
+ def test_compose_lab_manager_response_accepts_feasible_protocol() -> None:
328
+ scenario = _scenario("ml_benchmark", "easy")
329
+ protocol = _protocol_for_scenario(scenario)
330
+ check = check_feasibility(protocol, scenario)
331
+
332
+ action = compose_lab_manager_response(check)
333
+
334
+ assert action.action_type is LabManagerActionType.ACCEPT
335
+ assert action.feasible is True
336
+ assert action.suggested_technique == ""
337
+ assert "Accepted." in action.explanation
338
+
339
+
340
+ def test_compose_lab_manager_response_suggests_alternative_when_revision_exists() -> None:
341
+ scenario = _scenario("ml_benchmark", "easy")
342
+ protocol = _protocol_for_scenario(
343
+ scenario,
344
+ sample_size=200,
345
+ duration_days=scenario.lab_manager_observation.time_limit_days,
346
+ controls=["baseline", "ablation", "sanity_check"],
347
+ required_equipment=list(scenario.lab_manager_observation.equipment_available),
348
+ required_reagents=list(scenario.lab_manager_observation.reagents_in_stock),
349
+ )
350
+ check = check_feasibility(protocol, scenario)
351
+ suggestion = suggest_alternative(protocol, check, scenario)
352
+
353
+ assert suggestion is not None
354
+ action = compose_lab_manager_response(check, suggestion)
355
+
356
+ assert action.action_type is LabManagerActionType.SUGGEST_ALTERNATIVE
357
+ assert action.feasible is False
358
+ assert action.suggested_sample_size == suggestion.revised_protocol.sample_size
359
+ assert action.suggested_controls == suggestion.revised_protocol.controls
360
+ assert "Suggested revision:" in action.explanation
361
+
362
+
363
+ def test_compose_lab_manager_response_rejects_when_no_revision_exists() -> None:
364
+ scenario = _scenario("ml_benchmark", "easy")
365
+ protocol = _protocol_for_scenario(
366
+ scenario,
367
+ required_equipment=["Imaginary GPU Rack"],
368
+ )
369
+ check = check_feasibility(protocol, scenario)
370
+ suggestion = suggest_alternative(protocol, check, scenario)
371
+
372
+ action = compose_lab_manager_response(check, suggestion)
373
+
374
+ assert action.action_type is LabManagerActionType.REJECT
375
+ assert action.feasible is False
376
+ assert "No deterministic revision could satisfy" in action.explanation
377
+
378
+
379
+ def test_compose_lab_manager_response_reports_non_lab_issues() -> None:
380
+ scenario = _scenario("finance_trading", "easy")
381
+ protocol = _protocol_for_scenario(
382
+ scenario,
383
+ technique="live trading execution plan",
384
+ rationale="Use live trading once the backtest looks strong.",
385
+ )
386
+ check = check_feasibility(protocol, scenario)
387
+ suggestion = suggest_alternative(protocol, check, scenario)
388
+
389
+ action = compose_lab_manager_response(check, suggestion)
390
+
391
+ assert action.action_type is LabManagerActionType.REPORT_FEASIBILITY
392
+ assert action.feasible is True
393
+ assert "policy" in action.explanation.lower()
394
+
395
+
396
+ def test_compose_lab_manager_response_uses_custom_renderer_without_changing_verdict() -> None:
397
+ scenario = _scenario("ml_benchmark", "easy")
398
+ protocol = _protocol_for_scenario(scenario)
399
+ check = check_feasibility(protocol, scenario)
400
+
401
+ action = compose_lab_manager_response(
402
+ check,
403
+ explanation_renderer=lambda action_type, result, suggestion: (
404
+ f"Renderer saw {action_type.value} with feasible={result.feasible}."
405
+ ),
406
+ )
407
+
408
+ assert action.action_type is LabManagerActionType.ACCEPT
409
+ assert action.feasible is True
410
+ assert action.explanation == "Renderer saw accept with feasible=True."
tests/test_scientist_policy.py CHANGED
@@ -6,6 +6,7 @@ from replicalab.agents.scientist_policy import (
6
  RetryMetadata,
7
  ScientistCallResult,
8
  ScientistOutputParseError,
 
9
  build_scientist_system_prompt,
10
  call_scientist_with_retry,
11
  format_scientist_observation,
@@ -456,3 +457,102 @@ def test_retry_metadata_serializable() -> None:
456
  restored = RetryMetadata.model_validate_json(dumped)
457
  assert restored.attempt_count == 1
458
  assert restored.retry_count == 0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  RetryMetadata,
7
  ScientistCallResult,
8
  ScientistOutputParseError,
9
+ build_baseline_scientist_action,
10
  build_scientist_system_prompt,
11
  call_scientist_with_retry,
12
  format_scientist_observation,
 
457
  restored = RetryMetadata.model_validate_json(dumped)
458
  assert restored.attempt_count == 1
459
  assert restored.retry_count == 0
460
+
461
+
462
+ # ---------------------------------------------------------------------------
463
+ # AGT 04 - build_baseline_scientist_action
464
+ # ---------------------------------------------------------------------------
465
+
466
+
467
+ def test_baseline_scientist_proposes_protocol_for_fresh_observation() -> None:
468
+ action = build_baseline_scientist_action(_base_observation())
469
+
470
+ assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
471
+ assert action.sample_size >= 1
472
+ assert action.duration_days >= 1
473
+ assert action.questions == []
474
+ assert action.rationale
475
+
476
+
477
+ def test_baseline_scientist_accepts_existing_protocol_without_blocker() -> None:
478
+ obs = _base_observation(
479
+ current_protocol=Protocol(
480
+ sample_size=10,
481
+ controls=["baseline_check"],
482
+ technique="published_split_replication",
483
+ duration_days=2,
484
+ required_equipment=[],
485
+ required_reagents=[],
486
+ rationale="Initial protocol is already in place.",
487
+ ),
488
+ conversation_history=[
489
+ ConversationEntry(
490
+ role="lab_manager",
491
+ message="The current plan remains feasible.",
492
+ round_number=1,
493
+ action_type="report_feasibility",
494
+ )
495
+ ],
496
+ round_number=1,
497
+ )
498
+
499
+ action = build_baseline_scientist_action(obs)
500
+
501
+ assert action.action_type is ScientistActionType.ACCEPT
502
+ assert action.sample_size == 0
503
+ assert action.controls == []
504
+
505
+
506
+ def test_baseline_scientist_revises_when_latest_feedback_has_blocker() -> None:
507
+ obs = _base_observation(
508
+ current_protocol=Protocol(
509
+ sample_size=12,
510
+ controls=["published_split_check", "heldout_evaluation"],
511
+ technique="published_split_replication",
512
+ duration_days=3,
513
+ required_equipment=[],
514
+ required_reagents=[],
515
+ rationale="Original scope is full-size.",
516
+ ),
517
+ conversation_history=[
518
+ ConversationEntry(
519
+ role="lab_manager",
520
+ message="The current GPU plan is booked, so the schedule is too tight.",
521
+ round_number=1,
522
+ action_type="suggest_alternative",
523
+ )
524
+ ],
525
+ round_number=1,
526
+ )
527
+
528
+ action = build_baseline_scientist_action(obs)
529
+
530
+ assert action.action_type is ScientistActionType.REVISE_PROTOCOL
531
+ assert action.sample_size == 6
532
+ assert action.duration_days == 2
533
+ assert "latest Lab Manager concern" in action.rationale
534
+
535
+
536
+ def test_baseline_scientist_finishes_stub_episode_without_crashing() -> None:
537
+ from server.app import _StubEnv
538
+
539
+ env = _StubEnv()
540
+
541
+ first_observation = env.reset(
542
+ seed=14,
543
+ scenario="ml_benchmark",
544
+ difficulty="easy",
545
+ ).scientist
546
+ assert first_observation is not None
547
+
548
+ first_action = build_baseline_scientist_action(first_observation)
549
+ first_step = env.step(first_action)
550
+ assert first_step.done is False
551
+ assert first_step.observation is not None
552
+ assert first_step.observation.scientist is not None
553
+
554
+ second_action = build_baseline_scientist_action(first_step.observation.scientist)
555
+ second_step = env.step(second_action)
556
+
557
+ assert second_step.done is True
558
+ assert second_step.info.agreement_reached is True