ayushozha commited on
Commit
8b157ab
·
1 Parent(s): e50dca9

Recover env server client stack and deployment tracking

Browse files
Dockerfile ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Root-level Dockerfile for Hugging Face Spaces deployment.
2
+ #
3
+ # HF Spaces with sdk:docker expects the Dockerfile at the repo root.
4
+ # This is identical to server/Dockerfile. Keep them in sync or remove
5
+ # server/Dockerfile once the team standardizes on this root copy.
6
+
7
+ FROM python:3.11-slim
8
+
9
+ WORKDIR /app
10
+
11
+ # Install system deps
12
+ RUN apt-get update && apt-get install -y --no-install-recommends \
13
+ build-essential \
14
+ && rm -rf /var/lib/apt/lists/*
15
+
16
+ # Install Python dependencies first for better layer caching
17
+ COPY server/requirements.txt ./server/requirements.txt
18
+ RUN pip install --no-cache-dir -r server/requirements.txt
19
+
20
+ # Copy package source
21
+ COPY replicalab/ ./replicalab/
22
+ COPY server/ ./server/
23
+ COPY pyproject.toml ./
24
+
25
+ # Install the replicalab package (non-editable, deps already present)
26
+ RUN pip install --no-cache-dir . --no-deps
27
+
28
+ # Run as a non-root user inside the container (HF Spaces requirement)
29
+ RUN useradd -m -u 1000 appuser && chown -R appuser /app
30
+ USER appuser
31
+
32
+ EXPOSE 7860
33
+
34
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  # ReplicaLab
2
 
3
  **A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
@@ -8,11 +18,13 @@ ReplicaLab trains an agent to negotiate high-quality plans under real constraint
8
 
9
  ## Current Build Status
10
 
11
- - The repository is still in the foundation stage.
12
- - The Python package foundation is verified through editable install plus shared-model import checks.
13
  - Shared contracts currently live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
14
- - A stub-backed FastAPI and WebSocket server scaffold now exists in `server/app.py`, while real environment wiring is still in progress.
15
- - `openenv.yaml` now exists and passes local OpenEnv validation.
 
 
16
  - The frozen outer contract remains stable while the internal scenario engine moves toward a normalized scenario pack.
17
  - The planned Lab Manager path is hybrid: model-backed negotiation language plus deterministic feasibility grounding.
18
 
 
1
+ ---
2
+ title: ReplicaLab
3
+ emoji: 🧪
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
+
11
  # ReplicaLab
12
 
13
  **A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
 
18
 
19
  ## Current Build Status
20
 
21
+ - The repository is now past the foundation stage and has a working real environment plus deterministic judge pipeline.
22
+ - The Python package foundation is verified through editable install plus full test-suite checks.
23
  - Shared contracts currently live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
24
+ - `server/app.py` now serves the real `ReplicaLabEnv` by default, with the legacy stub retained only as a fallback safety path.
25
+ - `openenv.yaml` exists and passes local OpenEnv validation.
26
+ - Local Docker validation has been completed for the server image on port `7860`.
27
+ - Hugging Face Spaces Docker metadata is present in this README and the root `Dockerfile`; live hosted verification is still pending.
28
  - The frozen outer contract remains stable while the internal scenario engine moves toward a normalized scenario pack.
29
  - The planned Lab Manager path is hybrid: model-backed negotiation language plus deterministic feasibility grounding.
30
 
ReplicaLab_Comprehensive_Task_Division.md CHANGED
@@ -24,7 +24,7 @@ The goal is to let any team member pick up work immediately without confusion.
24
 
25
  **ReplicaLab** is an OpenEnv environment where a **Scientist agent** and a **Lab Manager agent** negotiate how to solve a constrained technical task under real world limits such as budget, tools, compute, schedule, stock, and staffing.
26
 
27
- The environment is used to **train the Scientist agent with reinforcement learning** so it learns to ask better questions, preserve objective quality, and produce more feasible plans under domain-specific constraints.
28
 
29
  The first domain focus is:
30
 
@@ -40,8 +40,8 @@ By judging time, the project should demonstrate:
40
 
41
  1. A working OpenEnv environment deployed on Hugging Face Spaces on port `7860`
42
  2. At least one full scenario family working end to end, with a target of three
43
- 3. A Scientist agent that can interact with the environment
44
- 4. A hybrid model-backed Lab Manager with deterministic feasibility grounding
45
  5. A deterministic judge and reward engine
46
  6. A Colab training notebook using Unsloth or HF TRL
47
  7. A reward curve showing improvement
@@ -58,31 +58,34 @@ By judging time, the project should demonstrate:
58
  1. OpenEnv environment implementation
59
  2. FastAPI and WebSocket serving
60
  3. Hugging Face Docker Space deployment
61
- 4. Scientist agent with structured JSON action output
62
- 5. Hybrid model-backed Lab Manager grounded by deterministic feasibility checks
63
  6. Judge rubric engine with deterministic scoring
64
  7. Three scenario families for MVP
65
  1. Mathematics reasoning and proof planning
66
  2. ML benchmark replication
67
  3. Finance or trading backtest planning
68
- 8. Reward logging
69
- 9. Replay logs
70
- 10. Colab RL notebook
71
- 11. Reward curve image
72
- 12. Thin React plus Vite frontend or OpenEnv `/web` fallback
73
- 13. README, demo video, submission package
 
74
 
75
  ## 3.2 Out of scope for the hackathon MVP
76
 
77
  1. Proving whether a real research paper is globally true or false
78
- 2. Parsing arbitrary real papers from the internet
79
  3. Real wet lab execution
80
  4. Live trading or production finance execution
81
  5. Real time collaboration features
82
  6. Training both Scientist and Lab Manager in self play
83
- 7. Complex third party enterprise integrations
84
- 8. Full multi-domain rollout unless time remains
85
- 9. Manager-led subagent orchestration unless the MVP is already stable
 
 
86
 
87
  ---
88
 
@@ -161,6 +164,56 @@ Rules for the normalized scenario layer:
161
  3. Difficulty and curriculum changes should mechanically alter constraints, resources, or conflicts rather than fork separate prompt logic.
162
  4. The deterministic scorer compares the final agreed plan against `hidden_reference_spec`; model-backed roles never own truth.
163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  ---
165
 
166
  ## 5. Module and function ownership map
@@ -175,6 +228,9 @@ Rules for the normalized scenario layer:
175
  | `replicalab/agents/scientist_policy.py` | `build_scientist_prompt()`, `parse_scientist_output()` | Person B | trainable role |
176
  | `replicalab/agents/lab_manager_policy.py` | `generate_lab_manager_response()`, `check_feasibility()` | Person B with Person A | model-backed negotiation grounded by deterministic checker |
177
  | `replicalab/agents/judge_policy.py` | `explain_judgement()` optional only | Person A | explanation layer only |
 
 
 
178
  | `replicalab/scoring/rigor.py` | `score_rigor()` | Person A | deterministic |
179
  | `replicalab/scoring/feasibility.py` | `score_feasibility()` | Person A | deterministic |
180
  | `replicalab/scoring/fidelity.py` | `score_fidelity()` | Person A | deterministic |
@@ -297,12 +353,12 @@ Create a stable shared codebase, contracts, and development workflow so all work
297
  - Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
298
  - Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
299
  - Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
300
- - Partial backend scope imported from Max's PR: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md` were normalized onto the current standards and validated locally against the stub env
301
  - Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
302
  - Newly unblocked by `FND 06`: `DOC 01`
303
  - Newly unblocked by `FND 03`: `FND 13`, `UI 01`
304
  - Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
305
- - Remaining completion items for the imported backend scaffold: real-env integration, Docker validation, and final deployment verification
306
  - Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
307
  - Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
308
  - Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
@@ -453,14 +509,14 @@ As the Lab Manager, I want grounded negotiation plus deterministic feasibility c
453
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
454
  | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | ✅ Completed | — |
455
  | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | ✅ Completed | — |
456
- | AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | Not started | — |
457
  | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ✅ Completed | — |
458
  | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ✅ Completed | Person B (Ayush) |
459
  | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ✅ Completed | — |
460
  | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | ✅ Completed | — |
461
- | AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | Not started | — |
462
  | AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
463
- | AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
464
  | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ✅ Completed | — |
465
 
466
  ---
@@ -478,20 +534,26 @@ As the training system, I need a stable reward so the model can improve.
478
  **US E05.2**
479
  As a judge, I need a readable score breakdown so I can understand why the environment rewarded or penalized the agent.
480
 
 
 
 
 
 
 
481
  ### Tasks
482
 
483
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
484
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
485
- | JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | SCN 08 | 1.25h | score is between 0 and 1 and matches rubric examples | Not started | |
486
- | JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches normalized constraint logic | Not started | |
487
- | JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples | Not started | |
488
- | JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output | Not started | |
489
- | JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores and penalties | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, and penalties | Not started | |
490
- | JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric and introduces no new hidden logic | ⬜ Not started | — |
491
- | JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement | ⬜ Not started | — |
492
  | JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | ⬜ Not started | — |
493
  | JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | ⬜ Not started | — |
494
- | JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity over time | ⬜ Not started | — |
495
  | JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | ⬜ Not started | — |
496
 
497
  ---
@@ -516,14 +578,14 @@ As a judge, I want deterministic replay and cleanup.
516
 
517
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
518
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
519
- | ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | Not started | |
520
- | ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | Not started | |
521
- | ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application | ENV 02, AGT 05 | 1h | valid Scientist action updates state and history correctly | Not started | |
522
- | ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step | ENV 03, AGT 07 | 1h | lab manager response is appended and returned in the next observation | Not started | |
523
- | ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | Not started | |
524
- | ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward and breakdown info | Not started | |
525
- | ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | Not started | |
526
- | ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | Not started | |
527
  | ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | ⬜ Not started | — |
528
  | ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | ⬜ Not started | — |
529
  | ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | ⬜ Not started | — |
@@ -548,23 +610,23 @@ As the team, we want one click reproducible deployment to HF Spaces.
548
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
549
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
550
  | API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | 🟡 Partial | — |
551
- | API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | 🟡 Partial | |
552
- | API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | 🟡 Partial | |
553
- | API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | 🟡 Partial | |
554
  | API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | ⬜ Not started | — |
555
- | API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | 🟡 Partial | |
556
- | API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | 🟡 Partial | |
557
- | API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | 🟡 Partial | |
558
- | API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | Not started | |
559
  | API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | ⬜ Not started | — |
560
  | API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | ⬜ Not started | — |
561
  | API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | ⬜ Not started | — |
562
- | API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | 🟡 Partial | |
563
  | API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | 🟡 Partial | — |
564
- | API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | Not started | |
565
  | API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | ⬜ Not started | — |
566
- | API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for Scientist LLM access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | ⬜ Not started | — |
567
- | API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes` and verdict fields without separate log file access | ⬜ Not started | — |
568
  | API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | ⬜ Not started | — |
569
 
570
  ---
@@ -586,21 +648,21 @@ As the team, we want a repeatable evaluation workflow for before versus after co
586
 
587
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
588
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
589
- | TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order | ⬜ Not started | — |
590
  | TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | ⬜ Not started | — |
591
- | TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env | ⬜ Not started | — |
592
- | TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, and done signals | ⬜ Not started | — |
593
- | TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors | ⬜ Not started | — |
594
- | TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, and rounds used | JDG 10, TRN 04 | 0.75h | notebook stores metrics frame across training episodes | ⬜ Not started | — |
595
  | TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | ⬜ Not started | — |
596
- | TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios | ⬜ Not started | — |
597
  | TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | ⬜ Not started | — |
598
  | TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | ⬜ Not started | — |
599
  | TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | ⬜ Not started | — |
600
  | TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | ⬜ Not started | — |
601
- | TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | Not started | |
602
  | TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | ⬜ Not started | — |
603
- | TRN 15 | E08.2 | Person B | notebook | Add agreement rate and invalid action rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, and invalid action rate for baseline and trained runs | ⬜ Not started | — |
604
 
605
  ---
606
 
@@ -661,7 +723,7 @@ As a judge, I want the same seeded scenario to be replayable.
661
  | OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | ⬜ Not started | — |
662
  | OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | ⬜ Not started | — |
663
  | OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | ⬜ Not started | — |
664
- | OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility | ⬜ Not started | — |
665
  | OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | ⬜ Not started | — |
666
  | OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | ⬜ Not started | — |
667
  | OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | ⬜ Not started | — |
@@ -685,15 +747,15 @@ As a judge, I want the system to work reliably when clicked live.
685
 
686
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
687
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
688
- | TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | Not started | |
689
- | TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | Not started | |
690
- | TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | Not started | |
691
- | TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | Not started | |
692
- | TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | Not started | |
693
  | TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | ⬜ Not started | — |
694
- | TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | Not started | |
695
  | TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | ⬜ Not started | — |
696
- | TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits | ⬜ Not started | — |
697
  | TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | ⬜ Not started | — |
698
  | TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | ⬜ Not started | — |
699
  | TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | ⬜ Not started | — |
@@ -877,7 +939,7 @@ The environment client must expose:
877
  3. reward
878
  4. done
879
  5. final info including component scores
880
- 6. API key or secret configuration for LLM access in both hosted and notebook environments
881
 
882
  ### Scenario to judge contract
883
  Every scenario must provide:
 
24
 
25
  **ReplicaLab** is an OpenEnv environment where a **Scientist agent** and a **Lab Manager agent** negotiate how to solve a constrained technical task under real world limits such as budget, tools, compute, schedule, stock, and staffing.
26
 
27
+ The environment is used to **train the Scientist agent with reinforcement learning** so it learns to ask better questions, preserve objective quality, use bounded evidence tools correctly, and produce more feasible plans under domain-specific constraints.
28
 
29
  The first domain focus is:
30
 
 
40
 
41
  1. A working OpenEnv environment deployed on Hugging Face Spaces on port `7860`
42
  2. At least one full scenario family working end to end, with a target of three
43
+ 3. A Scientist agent that can interact with the environment through structured actions and bounded evidence tools
44
+ 4. A hybrid model-backed Lab Manager with deterministic feasibility grounding and bounded validation tools
45
  5. A deterministic judge and reward engine
46
  6. A Colab training notebook using Unsloth or HF TRL
47
  7. A reward curve showing improvement
 
58
  1. OpenEnv environment implementation
59
  2. FastAPI and WebSocket serving
60
  3. Hugging Face Docker Space deployment
61
+ 4. Scientist agent with structured JSON action output plus bounded search, code-check, and image-inspection capability
62
+ 5. Hybrid model-backed Lab Manager grounded by deterministic feasibility checks plus bounded validation tools
63
  6. Judge rubric engine with deterministic scoring
64
  7. Three scenario families for MVP
65
  1. Mathematics reasoning and proof planning
66
  2. ML benchmark replication
67
  3. Finance or trading backtest planning
68
+ 8. Frozen evidence packs for deterministic training plus limited live validation during demo or eval
69
+ 9. Reward logging
70
+ 10. Replay logs
71
+ 11. Colab RL notebook
72
+ 12. Reward curve image
73
+ 13. Thin React plus Vite frontend or OpenEnv `/web` fallback
74
+ 14. README, demo video, submission package
75
 
76
  ## 3.2 Out of scope for the hackathon MVP
77
 
78
  1. Proving whether a real research paper is globally true or false
79
+ 2. Unrestricted parsing of arbitrary live internet content inside the training loop
80
  3. Real wet lab execution
81
  4. Live trading or production finance execution
82
  5. Real time collaboration features
83
  6. Training both Scientist and Lab Manager in self play
84
+ 7. Open-ended autonomous coding outside a bounded verification or analysis sandbox
85
+ 8. Image generation or audio capabilities in the agent policy loop
86
+ 9. Complex third party enterprise integrations
87
+ 10. Full multi-domain rollout unless time remains
88
+ 11. Manager-led subagent orchestration unless the MVP is already stable
89
 
90
  ---
91
 
 
164
  3. Difficulty and curriculum changes should mechanically alter constraints, resources, or conflicts rather than fork separate prompt logic.
165
  4. The deterministic scorer compares the final agreed plan against `hidden_reference_spec`; model-backed roles never own truth.
166
 
167
+ For the bounded-tool MVP, pending scenario and environment work will extend the
168
+ existing normalized scenario pack with additive evidence fields. This is an
169
+ extension below the frozen outer contract, not a reopening of `FND 08`,
170
+ `MOD 01`, `MOD 02`, or `MOD 03`.
171
+
172
+ Tool-capable scenario extensions:
173
+
174
+ 1. `evidence_pack`
175
+ 2. `artifact_refs`
176
+ 3. `allowed_tools`
177
+ 4. `tool_budget`
178
+ 5. `validation_policy`
179
+
180
+ ## 4.3 Bounded tool capability policy
181
+
182
+ The richer-capability MVP keeps the final outward action contract stable while
183
+ adding bounded tools below it.
184
+
185
+ ### Scientist allowed capabilities
186
+
187
+ 1. `search_evidence`
188
+ - retrieve supporting facts, benchmark rules, paper details, or official references
189
+ - not a reward source
190
+ 2. `run_code_check`
191
+ - bounded code or config analysis, metric checks, value generation, runtime or cost estimation
192
+ 3. `inspect_image`
193
+ - read tables, plots, figures, screenshots, and charts for evidence extraction
194
+
195
+ ### Lab Manager allowed capabilities
196
+
197
+ 1. `search_resources`
198
+ - retrieve resource, policy, benchmark, or documentation constraints
199
+ 2. `run_code_check`
200
+ - validate cost, runtime, config, reproducibility, or execution assumptions
201
+ 3. `inspect_image`
202
+ - inspect figures, charts, and screenshots relevant to feasibility or policy review
203
+
204
+ ### Judge capability rules
205
+
206
+ 1. The judge reward remains deterministic and must not depend on live web search.
207
+ 2. Tool traces and evidence references may inform deterministic penalties, bonuses, or audit text.
208
+ 3. The judge may use bounded evidence verification for demo or audit text, but never as the training reward source.
209
+
210
+ ### Training and demo rules
211
+
212
+ 1. Training uses frozen evidence packs and deterministic tool traces whenever possible.
213
+ 2. Live web search is limited to demo-time or eval-time validation, not the core training reward loop.
214
+ 3. Image generation and audio are excluded from the policy loop for the hackathon MVP.
215
+ 4. Coding capability must stay sandboxed and task-scoped rather than open-ended.
216
+
217
  ---
218
 
219
  ## 5. Module and function ownership map
 
228
  | `replicalab/agents/scientist_policy.py` | `build_scientist_prompt()`, `parse_scientist_output()` | Person B | trainable role |
229
  | `replicalab/agents/lab_manager_policy.py` | `generate_lab_manager_response()`, `check_feasibility()` | Person B with Person A | model-backed negotiation grounded by deterministic checker |
230
  | `replicalab/agents/judge_policy.py` | `explain_judgement()` optional only | Person A | explanation layer only |
231
+ | `replicalab/tools/search.py` | `search_evidence()`, `search_resources()` | Person B with Person C | bounded retrieval and validation only |
232
+ | `replicalab/tools/code_tools.py` | `run_code_check()` | Person B | bounded code analysis, config checks, and derived-value generation |
233
+ | `replicalab/tools/image_tools.py` | `inspect_image()` | Person B with Person D | bounded table, chart, figure, and screenshot inspection |
234
  | `replicalab/scoring/rigor.py` | `score_rigor()` | Person A | deterministic |
235
  | `replicalab/scoring/feasibility.py` | `score_feasibility()` | Person A | deterministic |
236
  | `replicalab/scoring/fidelity.py` | `score_fidelity()` | Person A | deterministic |
 
353
  - Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
354
  - Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
355
  - Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
356
+ - Backend and deployment scope imported from Max's PR has now been normalized onto the current standards, validated against the real env, Docker-verified locally, and extended with HF Spaces metadata plus deployment instructions
357
  - Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
358
  - Newly unblocked by `FND 06`: `DOC 01`
359
  - Newly unblocked by `FND 03`: `FND 13`, `UI 01`
360
  - Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
361
+ - Remaining completion items for the backend and deployment path: live HF Space bring-up (`API 10`), secrets documentation (`API 17`), replay persistence, and the remaining partial API polish tasks
362
  - Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
363
  - Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
364
  - Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
 
509
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
510
  | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | ✅ Completed | — |
511
  | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | ✅ Completed | — |
512
+ | AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | Completed | — |
513
  | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ✅ Completed | — |
514
  | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ✅ Completed | Person B (Ayush) |
515
  | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ✅ Completed | — |
516
  | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | ✅ Completed | — |
517
+ | AGT 08 | E04.1 | Person B | tests | Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path, malformed output handling, and stable tool-policy reminders | ✅ Completed | — |
518
  | AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
519
+ | AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt`, including bounded rules for search, code checks, and image inspection | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
520
  | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ✅ Completed | — |
521
 
522
  ---
 
534
  **US E05.2**
535
  As a judge, I need a readable score breakdown so I can understand why the environment rewarded or penalized the agent.
536
 
537
+ ### Executor notes
538
+
539
+ - `JDG 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
540
+ - `JDG 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
541
+ - `JDG 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
542
+
543
  ### Tasks
544
 
545
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
546
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
547
+ | JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor or objective-validity score for plan completeness, required checks, method quality, justification, and correct bounded evidence use when present | SCN 08 | 1.25h | score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning without depending on live web results | Completed | Person B (Ayush) |
548
+ | JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, resources, time, staffing, compute, bookings, and deterministic tool-backed validation results | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches normalized constraint logic plus deterministic tool outcomes | Completed | Person B (Ayush) |
549
+ | JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score against hidden reference spec, required steps, allowed substitutions, and supported evidence claims when present | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples for plan and evidence alignment | Completed | Person B (Ayush) |
550
+ | JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties, including deterministic penalties for invalid tool use or unsupported evidence claims | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output for plan quality and bounded tool behavior | Completed | Person B (Ayush) |
551
+ | JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores, penalties, and tool-use diagnostics | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics | Completed | Person B (Ayush) |
552
+ | JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic | ⬜ Not started | — |
553
+ | JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics | ⬜ Not started | — |
554
  | JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | ⬜ Not started | — |
555
  | JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | ⬜ Not started | — |
556
+ | JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity, and bounded tool metrics over time | ⬜ Not started | — |
557
  | JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | ⬜ Not started | — |
558
 
559
  ---
 
578
 
579
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
580
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
581
+ | ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | Completed | Person B (Ayush) |
582
+ | ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | Completed | Person B (Ayush) |
583
+ | ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application and bounded tool mediation | ENV 02, AGT 05 | 1h | valid Scientist action plus any allowed tool traces update state and history correctly | Completed | Person B (Ayush) |
584
+ | ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step with bounded validation tools | ENV 03, AGT 07 | 1h | lab manager response plus any supporting bounded tool traces are appended and returned in the next observation | Completed | Person B (Ayush) |
585
+ | ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | Completed | Person B (Ayush) |
586
+ | ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward, breakdown info, and deterministic penalties or bonuses for bounded tool behavior | Completed | Person B (Ayush) |
587
+ | ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | Completed | Person B (Ayush) |
588
+ | ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | Completed | Person B (Ayush) |
589
  | ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | ⬜ Not started | — |
590
  | ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | ⬜ Not started | — |
591
  | ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | ⬜ Not started | — |
 
610
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
611
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
612
  | API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | 🟡 Partial | — |
613
+ | API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | Completed | Person B (Ayush) |
614
+ | API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | Completed | Person B (Ayush) |
615
+ | API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | Completed | Person B (Ayush) |
616
  | API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | ⬜ Not started | — |
617
+ | API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | Completed | Person B (Ayush) |
618
+ | API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | Completed | Person B (Ayush) |
619
+ | API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | Completed | Person B (Ayush) |
620
+ | API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | Completed | Person B (Ayush) |
621
  | API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | ⬜ Not started | — |
622
  | API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | ⬜ Not started | — |
623
  | API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | ⬜ Not started | — |
624
+ | API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | Completed | Person B (Ayush) |
625
  | API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | 🟡 Partial | — |
626
+ | API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | Completed | Person B (Ayush) |
627
  | API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | ⬜ Not started | — |
628
+ | API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for hosted Scientist model access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | ⬜ Not started | — |
629
+ | API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload plus bounded tool-trace summaries in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes`, verdict fields, and bounded tool audit data without separate log file access | ⬜ Not started | — |
630
  | API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | ⬜ Not started | — |
631
 
632
  ---
 
648
 
649
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
650
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
651
+ | TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, bounded-tool policy, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order and documents the bounded-tool policy | ⬜ Not started | — |
652
  | TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | ⬜ Not started | — |
653
+ | TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env and can read tool-aware step payloads | ⬜ Not started | — |
654
+ | TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs | ⬜ Not started | — |
655
+ | TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs | ⬜ Not started | — |
656
+ | TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics | JDG 10, TRN 04 | 0.75h | notebook stores a metrics frame across training episodes including bounded tool metrics | ⬜ Not started | — |
657
  | TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | ⬜ Not started | — |
658
+ | TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds and frozen evidence packs | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios and evidence packs | ⬜ Not started | — |
659
  | TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | ⬜ Not started | — |
660
  | TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | ⬜ Not started | — |
661
  | TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | ⬜ Not started | — |
662
  | TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | ⬜ Not started | — |
663
+ | TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | Done | 2026-03-08 |
664
  | TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | ⬜ Not started | — |
665
+ | TRN 15 | E08.2 | Person B | notebook | Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs | ⬜ Not started | — |
666
 
667
  ---
668
 
 
723
  | OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | ⬜ Not started | — |
724
  | OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | ⬜ Not started | — |
725
  | OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | ⬜ Not started | — |
726
+ | OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy | ⬜ Not started | — |
727
  | OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | ⬜ Not started | — |
728
  | OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | ⬜ Not started | — |
729
  | OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | ⬜ Not started | — |
 
747
 
748
  | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
749
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
750
+ | TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | Completed | Person B (Ayush) |
751
+ | TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | Completed | Person B (Ayush) |
752
+ | TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | Completed | Person B (Ayush) |
753
+ | TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | Completed | Person B (Ayush) |
754
+ | TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | Completed | Person B (Ayush) |
755
  | TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | ⬜ Not started | — |
756
+ | TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | Completed | Person B (Ayush) |
757
  | TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | ⬜ Not started | — |
758
+ | TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs | ⬜ Not started | — |
759
  | TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | ⬜ Not started | — |
760
  | TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | ⬜ Not started | — |
761
  | TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | ⬜ Not started | — |
 
939
  3. reward
940
  4. done
941
  5. final info including component scores
942
+ 6. API key or secret configuration for hosted-model access in both hosted and notebook environments
943
 
944
  ### Scenario to judge contract
945
  Every scenario must provide:
docs/ayush/task_breakdown.md CHANGED
@@ -9,39 +9,52 @@ No assumptions from other documents are used to reclassify blocked status.
9
 
10
  ## 1. Blocking Status
11
 
12
- `FND 08`, `FND 09`, `MOD 09`, `SCN 11`, and `AGT 01` are now complete.
 
 
13
  The scenario prerequisite bundle (`SCN 01` to `SCN 10`) also exists in the
14
  repo, so Ayush no longer waits on `SCN 09` to start prompt-layer work.
15
 
16
  Ayush now has one fully unblocked task:
17
 
18
- 1. `AGT 03` -- highest leverage next task inside the Scientist chain
19
 
20
  The prompt and Lab Manager workstream continues to assume a normalized scenario
21
  pack below the stable outer contract, so Ayush-owned prompting should be
22
  assembled from mapped scenario data rather than hard-coded to one domain.
23
 
 
 
 
 
 
 
 
 
 
24
  ---
25
 
26
  ## 2. Active Now
27
 
28
  | ID | Task | Depends On | Why It Is Ready | Est |
29
  |----|------|-----------|-----------------|-----|
30
- | AGT 03 | Parse plus retry for malformed output | MOD 09, AGT 02 | The parser and observation formatter are now both complete | 0.75h |
31
 
32
- **Total: 1 task, 0.75h**
33
 
34
  ---
35
 
36
- ## 3. Internal Ayush Chain After AGT 03
37
 
38
  These are blocked only by earlier Ayush-owned work.
39
 
40
  | ID | Task | Depends On | Blocked By | Est |
41
  |----|------|-----------|-----------|-----|
42
- | AGT 08 | Prompt formatting and parse tests | AGT 01 to AGT 04 | Person B: AGT 03 | 0.75h |
 
 
43
 
44
- **Total: 1 task, 0.75h**
45
 
46
  ---
47
 
@@ -49,46 +62,33 @@ These are blocked only by earlier Ayush-owned work.
49
 
50
  | ID | Task | Depends On | Remaining External Deliverable | Est |
51
  |----|------|-----------|-------------------------------|-----|
52
- | JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | `JDG 05` from Kian and `JDG 07` from Max | 0.5h |
 
53
 
54
- **Total: 1 task, 0.5h**
55
 
56
  ### What to ask Kian for first
57
 
58
- 1. `JDG 05` and `JDG 06` -- unlock `JDG 10` and later `AGT 10`
59
  2. `SCN 13` -- deepens the booking-conflict layer for the Lab Manager path
60
- 3. `ENV 01` -- makes the real environment path available beyond the stub server
61
-
62
- ---
63
-
64
- ## 5. Mixed Chain After AGT 05 and Judge Work
65
-
66
- These depend on both Ayush-owned work and remaining upstream work.
67
-
68
- | ID | Task | Depends On | Blocked By | Est |
69
- |----|------|-----------|-----------|-----|
70
- | AGT 10 | Write domain-neutral prompt text files for all 3 roles | AGT 01, AGT 07, JDG 06 | Person A: JDG 06 | 0.75h |
71
-
72
- **Total: 1 task, 0.75h**
73
 
74
  ---
75
 
76
- ## 6. Blocked by Max (Person C)
77
 
78
- Cannot proceed until Max delivers the server and deployment pieces.
79
 
80
  | ID | Task | Depends On | Max Deliverable | Est |
81
  |----|------|-----------|----------------|-----|
82
- | TRN 01 | Notebook skeleton | API 10 | Deployed HF Space | 0.5h |
83
- | TRN 03 | Env client wrapper in notebook | API 06 | WebSocket handler against the real env | 1h |
84
- | TRN 13 | `client.py` reusable module | API 06 | WebSocket handler against the real env | 1h |
85
 
86
- **Total: 3 tasks, 2.5h**
87
 
88
  ### What to ask Max for first
89
 
90
- 1. `API 06` -- unblocks `TRN 03` and `TRN 13`
91
- 2. `API 10` -- unblocks `TRN 01`
92
 
93
  ---
94
 
@@ -99,19 +99,20 @@ are done.
99
 
100
  | Order | ID | Task | Depends On | Est |
101
  |-------|----|------|-----------|-----|
102
- | 1 | TRN 02 | Package install and model setup cell | TRN 01 | 0.75h |
103
- | 2 | TRN 14 | Select and document base model (notebook side) | TRN 01 | 0.5h |
104
- | 3 | TRN 04 | Rollout collection loop | TRN 03, AGT 01 | 1h |
105
- | 4 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | 1.25h |
106
- | 5 | TRN 06 | Log episode metrics | JDG 10, TRN 04 | 0.75h |
107
- | 6 | TRN 07 | Plot reward curves | TRN 06 | 0.5h |
108
- | 7 | TRN 08 | Before vs after eval on fixed seeds | SCN 11, TRN 05 | 1h |
109
- | 8 | TRN 09 | Policy loading for trained checkpoint | TRN 05 | 0.5h |
110
- | 9 | TRN 10 | Export plots to outputs/plots | TRN 07 | 0.25h |
111
- | 10 | TRN 15 | Agreement and invalid action rate aggregation | TRN 06, TRN 08, OBS 09 | 0.5h |
112
- | 11 | OBS 06 | Log training run metadata | TRN 06 | 0.5h |
113
-
114
- **Total: 11 tasks, 7.5h**
 
115
 
116
  ---
117
 
@@ -134,43 +135,46 @@ are done.
134
  3. `MOD 09`
135
  4. `SCN 11`
136
  5. `AGT 01`
 
 
 
 
 
 
 
 
 
137
 
138
  ### Phase 2: Active now
139
 
140
- 6. `AGT 03`
141
 
142
- ### Phase 3: After AGT 03
143
 
144
- 7. `AGT 08`
 
 
145
 
146
  ### Phase 4: After judge work
147
 
148
- 8. `AGT 10`
149
- 9. `JDG 10`
150
-
151
- ### Phase 5: After Max lands `API 06` and `API 10`
152
-
153
- 10. `TRN 13`
154
- 11. `TRN 01`
155
- 12. `TRN 02`
156
- 13. `TRN 03`
157
- 14. `TRN 14`
158
 
159
- ### Phase 6: Training pipeline
160
 
161
- 15. `TRN 04`
162
- 16. `TRN 05`
163
- 17. `TRN 06`
164
- 18. `TRN 07`
165
- 19. `TRN 08`
166
- 20. `TRN 09`
167
- 21. `TRN 10`
168
- 22. `TRN 15`
169
- 23. `OBS 06`
170
 
171
  ### Phase 7: Final notebook validation
172
 
173
- 24. `TST 09`
174
 
175
  ---
176
 
@@ -178,14 +182,13 @@ are done.
178
 
179
  | Category | Count | Hours |
180
  |----------|-------|-------|
181
- | Active now | 1 | 0.75h |
182
- | Internal Ayush chain after AGT 03 | 1 | 0.75h |
183
- | Blocked by Kian or mixed A+B work | 1 | 0.5h |
184
- | Mixed chain after AGT 05 and judge work | 1 | 0.75h |
185
- | Blocked by Max | 3 | 2.5h |
186
- | Deep training chain | 11 | 7.5h |
187
  | Blocked by Kush | 1 | 0.5h |
188
- | **Total remaining** | **19** | **13.25h** |
189
 
190
  ---
191
 
 
9
 
10
  ## 1. Blocking Status
11
 
12
+ `FND 08`, `FND 09`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`,
13
+ `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 08`, `AGT 11`, and `TRN 13`
14
+ are now complete.
15
  The scenario prerequisite bundle (`SCN 01` to `SCN 10`) also exists in the
16
  repo, so Ayush no longer waits on `SCN 09` to start prompt-layer work.
17
 
18
  Ayush now has one fully unblocked task:
19
 
20
+ 1. `TRN 03` -- environment client wrapper for notebook rollouts (uses `replicalab/client.py` from TRN 13)
21
 
22
  The prompt and Lab Manager workstream continues to assume a normalized scenario
23
  pack below the stable outer contract, so Ayush-owned prompting should be
24
  assembled from mapped scenario data rather than hard-coded to one domain.
25
 
26
+ Bounded-tool scope note:
27
+
28
+ 1. Ayush-owned prompt, training, and client tasks now assume bounded `search`,
29
+ `code_check`, and `image_inspection` capabilities.
30
+ 2. Training must still use frozen evidence packs and deterministic reward.
31
+ 3. Live web search is for validation or demo-time evidence only, not the core
32
+ reward loop.
33
+ 4. Audio remains out of scope.
34
+
35
  ---
36
 
37
  ## 2. Active Now
38
 
39
  | ID | Task | Depends On | Why It Is Ready | Est |
40
  |----|------|-----------|-----------------|-----|
41
+ | TRN 03 | Env client wrapper in notebook | API 06, TRN 13 | `replicalab/client.py` is complete with dual-transport support; TRN 03 wraps it for notebook rollout use | 1h |
42
 
43
+ **Total: 1 task, 1h**
44
 
45
  ---
46
 
47
+ ## 3. Internal Ayush Chain After API 06
48
 
49
  These are blocked only by earlier Ayush-owned work.
50
 
51
  | ID | Task | Depends On | Blocked By | Est |
52
  |----|------|-----------|-----------|-----|
53
+ | TRN 04 | Rollout collection loop with frozen evidence packs and bounded tool traces | TRN 03, AGT 01 | Person B: TRN 03 | 1h |
54
+ | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | Person B: TRN 04 | 1.25h |
55
+ | TRN 09 | Policy loading for trained checkpoint | TRN 05 | Person B: TRN 05 | 0.5h |
56
 
57
+ **Total: 3 tasks, 2.75h**
58
 
59
  ---
60
 
 
62
 
63
  | ID | Task | Depends On | Remaining External Deliverable | Est |
64
  |----|------|-----------|-------------------------------|-----|
65
+ | AGT 10 | Write domain-neutral prompt text files for all 3 roles with bounded tool rules | AGT 01, AGT 07, JDG 06 | `JDG 06` from Kian | 0.75h |
66
+ | JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | `JDG 07` from Max | 0.5h |
67
 
68
+ **Total: 2 tasks, 1.25h**
69
 
70
  ### What to ask Kian for first
71
 
72
+ 1. `JDG 06` -- unlocks `AGT 10`
73
  2. `SCN 13` -- deepens the booking-conflict layer for the Lab Manager path
74
+ 3. `ENV 10` and `JDG 08` -- strengthen the env or judge regression layer before training ramps
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ---
77
 
78
+ ## 5. Blocked by Max (Person C)
79
 
80
+ Cannot proceed until Max delivers the remaining server and deployment pieces.
81
 
82
  | ID | Task | Depends On | Max Deliverable | Est |
83
  |----|------|-----------|----------------|-----|
84
+ | TRN 01 | Notebook skeleton | API 10 | Deployed HF Space or stable hosted env URL | 0.5h |
 
 
85
 
86
+ **Total: 1 task, 0.5h**
87
 
88
  ### What to ask Max for first
89
 
90
+ 1. `API 10` -- unlocks `TRN 01`
91
+ 2. `JDG 07` -- unlocks `JDG 10`
92
 
93
  ---
94
 
 
99
 
100
  | Order | ID | Task | Depends On | Est |
101
  |-------|----|------|-----------|-----|
102
+ | 1 | TRN 01 | Notebook skeleton | API 10 | 0.5h |
103
+ | 2 | TRN 02 | Package install and model setup cell | TRN 01 | 0.75h |
104
+ | 3 | TRN 14 | Select and document base model (notebook side) | TRN 01 | 0.5h |
105
+ | 4 | TRN 04 | Rollout collection loop with frozen evidence packs and bounded tool traces | TRN 03, AGT 01 | 1h |
106
+ | 5 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | 1.25h |
107
+ | 6 | TRN 06 | Log episode metrics plus bounded tool metrics | JDG 10, TRN 04 | 0.75h |
108
+ | 7 | TRN 07 | Plot reward curves | TRN 06 | 0.5h |
109
+ | 8 | TRN 08 | Before vs after eval on fixed seeds and frozen evidence packs | SCN 11, TRN 05 | 1h |
110
+ | 9 | TRN 09 | Policy loading for trained checkpoint | TRN 05 | 0.5h |
111
+ | 10 | TRN 10 | Export plots to outputs/plots | TRN 07 | 0.25h |
112
+ | 11 | TRN 15 | Agreement, invalid action, and invalid bounded-tool rate aggregation | TRN 06, TRN 08, OBS 09 | 0.5h |
113
+ | 12 | OBS 06 | Log training run metadata | TRN 06 | 0.5h |
114
+
115
+ **Total: 12 tasks, 8h**
116
 
117
  ---
118
 
 
135
  3. `MOD 09`
136
  4. `SCN 11`
137
  5. `AGT 01`
138
+ 6. `AGT 02`
139
+ 7. `AGT 03`
140
+ 8. `AGT 04`
141
+ 9. `AGT 05`
142
+ 10. `AGT 06`
143
+ 11. `AGT 07`
144
+ 12. `AGT 08`
145
+ 13. `AGT 11`
146
+ 14. `TRN 13`
147
 
148
  ### Phase 2: Active now
149
 
150
+ 15. `TRN 03`
151
 
152
+ ### Phase 3: After `API 10`
153
 
154
+ 16. `TRN 01`
155
+ 17. `TRN 02`
156
+ 18. `TRN 14`
157
 
158
  ### Phase 4: After judge work
159
 
160
+ 19. `AGT 10`
161
+ 20. `JDG 10`
 
 
 
 
 
 
 
 
162
 
163
+ ### Phase 5: Training pipeline
164
 
165
+ 21. `TRN 04`
166
+ 22. `TRN 05`
167
+ 23. `TRN 06`
168
+ 24. `TRN 07`
169
+ 25. `TRN 08`
170
+ 26. `TRN 09`
171
+ 27. `TRN 10`
172
+ 28. `TRN 15`
173
+ 29. `OBS 06`
174
 
175
  ### Phase 7: Final notebook validation
176
 
177
+ 30. `TST 09`
178
 
179
  ---
180
 
 
182
 
183
  | Category | Count | Hours |
184
  |----------|-------|-------|
185
+ | Active now | 1 | 1h |
186
+ | Internal Ayush chain after API 06 | 3 | 2.75h |
187
+ | Blocked by Kian or mixed A+B work | 2 | 1.25h |
188
+ | Blocked by Max | 1 | 0.5h |
189
+ | Remaining downstream training chain | 8 | 4.75h |
 
190
  | Blocked by Kush | 1 | 0.5h |
191
+ | **Total remaining** | **16** | **10.75h** |
192
 
193
  ---
194
 
docs/ayush/task_list.md CHANGED
@@ -11,14 +11,17 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
11
  - `SCN 11` is complete in `tests/fixtures/golden_scenarios.json`
12
  - `AGT 01` is complete in `replicalab/agents/scientist_policy.py`
13
  - `AGT 02` is complete in `replicalab/agents/scientist_policy.py`
 
14
  - `AGT 04` is complete in `replicalab/agents/scientist_policy.py`
15
  - `AGT 05` is complete in `replicalab/agents/lab_manager_policy.py`
16
  - `AGT 06` is complete in `replicalab/agents/lab_manager_policy.py`
17
  - `AGT 07` is complete in `replicalab/agents/lab_manager_policy.py`
 
18
  - `AGT 11` is complete in `docs/agt11_scientist_model_selection.md`
19
  - The scenario prerequisite bundle (`SCN 01` to `SCN 10`) is now present in the repo, so Ayush prompt work is backed by real normalized scenario packs instead of placeholders
20
- - The next fully unblocked Ayush task is `AGT 03`
21
- - `AGT 03` is now the highest-leverage next step because the formatter and parser are both in place, so the retry loop can complete the Scientist action path end-to-end
 
22
  - `AGT 10` now waits only on `JDG 06`
23
 
24
  ---
@@ -43,9 +46,9 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
43
  - [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
44
  - [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
45
  - [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
 
 
46
  - [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
47
- - [ ] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: ready now
48
- - [ ] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04
49
  - [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
50
 
51
  ---
@@ -68,7 +71,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
68
  - [ ] **TRN 08** | Add before versus after evaluation on fixed seeds | 1h | Depends: SCN 11, TRN 05
69
  - [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
70
  - [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
71
- - [ ] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06
72
  - [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Assumption: Qwen3-4B primary, Qwen3-8B H100-only stretch
73
  - [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
74
 
@@ -97,6 +100,6 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
97
  | Metric | Value |
98
  |--------|-------|
99
  | Total tasks | 29 |
100
- | Completed | 10 |
101
- | Remaining | 19 |
102
- | Total estimated hours | 21.5h |
 
11
  - `SCN 11` is complete in `tests/fixtures/golden_scenarios.json`
12
  - `AGT 01` is complete in `replicalab/agents/scientist_policy.py`
13
  - `AGT 02` is complete in `replicalab/agents/scientist_policy.py`
14
+ - `AGT 03` is complete in `replicalab/agents/scientist_policy.py`
15
  - `AGT 04` is complete in `replicalab/agents/scientist_policy.py`
16
  - `AGT 05` is complete in `replicalab/agents/lab_manager_policy.py`
17
  - `AGT 06` is complete in `replicalab/agents/lab_manager_policy.py`
18
  - `AGT 07` is complete in `replicalab/agents/lab_manager_policy.py`
19
+ - `AGT 08` is complete in `replicalab/agents/scientist_policy.py`
20
  - `AGT 11` is complete in `docs/agt11_scientist_model_selection.md`
21
  - The scenario prerequisite bundle (`SCN 01` to `SCN 10`) is now present in the repo, so Ayush prompt work is backed by real normalized scenario packs instead of placeholders
22
+ - `API 06` is now complete, so `TRN 03` and `TRN 13` were fully unblocked
23
+ - `TRN 13` is now complete in `replicalab/client.py`
24
+ - The next fully unblocked Ayush task is `TRN 03`
25
  - `AGT 10` now waits only on `JDG 06`
26
 
27
  ---
 
46
  - [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
47
  - [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
48
  - [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
49
+ - [x] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: completed on 2026-03-07
50
+ - [x] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04 | Status: completed on 2026-03-07
51
  - [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
 
 
52
  - [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
53
 
54
  ---
 
71
  - [ ] **TRN 08** | Add before versus after evaluation on fixed seeds | 1h | Depends: SCN 11, TRN 05
72
  - [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
73
  - [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
74
+ - [x] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06 | Status: completed on 2026-03-08
75
  - [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Assumption: Qwen3-4B primary, Qwen3-8B H100-only stretch
76
  - [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
77
 
 
100
  | Metric | Value |
101
  |--------|-------|
102
  | Total tasks | 29 |
103
+ | Completed | 13 |
104
+ | Remaining | 16 |
105
+ | Total estimated hours | 11.75h |
docs/changes.md CHANGED
@@ -30,4 +30,20 @@ Rules:
30
  | 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
31
  | 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
32
  | 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
 
30
  | 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
31
  | 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
32
  | 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
33
+ | 2026-03-08 | Person B (Ayush) | Capability scope and backlog | Expanded the MVP from pure constrained negotiation to bounded evidence-backed research planning with scoped search, code-check, and image-inspection capability, while explicitly excluding audio and unrestricted live web in training | The team decided that research applicability requires richer capabilities, but the hackathon still needs a deterministic RL story with bounded tools and reproducible rewards | The source-of-truth backlog now treats richer capabilities as an additive layer below the frozen outer contract; completed schema and agent work stays valid, while pending prompt, judge, environment, API, and training tasks now absorb bounded tool and evidence-pack support | Keep live web mostly for demo or eval validation, and keep frozen evidence packs as the default training path |
34
+ | 2026-03-07 | Person B (Ayush) | AGT 03 | Backlog showed "Not started" but the implementation (parse-and-retry loop with telemetry) already existed from a prior commit | The code and 7 tests were committed earlier but the tracker was never updated | Synced both `ReplicaLab_Comprehensive_Task_Division.md` and `docs/completion.md` to reflect completed status | None |
35
+ | 2026-03-07 | Person B (Ayush) | AGT 08 | Expanded scope from test-only to tests plus a bounded-tool policy prompt patch in `build_scientist_system_prompt()` | The acceptance criteria required testing bounded-tool policy reminders, but no tool-policy text existed in the prompt yet; user directed adding the prompt text alongside the tests | Added policy block for `search_evidence`, `run_code_check`, and `inspect_image` to the system prompt; wrote 24 new tests covering parser, prompt, formatter, baseline, and bounded-tool policy; all 111 tests pass | None |
36
+ | 2026-03-08 | Person B (Ayush) | ENV 01 | Executed the task even though it was assigned to Person A | The real environment class was still missing, but the server now switches to `ReplicaLabEnv` on successful import, so a working drop-in module was needed before environment and API work could safely proceed | Added `replicalab/env/replicalab_env.py` and `replicalab/env/__init__.py` as a working drop-in replacement for the former in-server stub, verified direct `reset() -> step() -> state() -> close()` behavior, and confirmed the full test suite stays green at `111 passed` | `ENV 02` and `ENV 08` are now unblocked, and the server can instantiate the real env class instead of the fallback stub |
37
+ | 2026-03-08 | Person B (Ayush) | JDG 01, JDG 02, JDG 03 | Executed three scoring tasks assigned to Person A | The judge scoring chain was the next critical-path blocker: JDG 04 (total reward formula) depends on all three, and ENV 06 (reward integration) depends on JDG 05 which depends on JDG 04 | Added `replicalab/scoring/rigor.py` (weighted structural completeness, success criteria coverage, required element coverage), `replicalab/scoring/feasibility.py` (7-dimension partial-credit scorer wrapping AGT 05 feasibility checker), `replicalab/scoring/fidelity.py` (substitution-aware hidden-reference adherence scorer), shared `replicalab/utils/text.py` (token extraction and label normalization), `replicalab/scoring/__init__.py` (exports), and `tests/test_reward.py` (18 tests covering ordering, determinism, partial credit, domain range, and cross-scorer consistency); all 134 tests pass | JDG 04 is now unblocked; tracker docs were synced separately |
38
+ | 2026-03-08 | Person B (Ayush) | ENV 02, ENV 03, ENV 04, ENV 05, ENV 06, ENV 07, ENV 08, JDG 04, JDG 05, TST 01, TST 02, TST 03 | Executed the full environment chain and rubric tasks assigned to Person A | The environment needed real scenario wiring, validation, grounded Lab Manager responses, centralized termination, judge-computed rewards, deep state snapshots, and close lifecycle guards; the rubric needed the total reward formula and breakdown builder; and the test suite needed reset, step, and invalid-action coverage | Rewrote `replicalab/env/replicalab_env.py` (ENV 02-08: scenario-pack-backed observations, protocol validation, grounded LM pipeline, accept-or-max-rounds termination, real judge scoring via rubric, deep state copies, closed-env guard), created `replicalab/scoring/rubric.py` (JDG 04-05: `compute_total_reward` with `10 × r × f × fi + bonuses − penalties`, `build_reward_breakdown` composing all three sub-scores with efficiency bonus), updated `replicalab/scoring/__init__.py` exports, and created `tests/test_env.py` (TST 01-03: 32 tests covering reset, step, invalid action, state snapshot, close/reopen, and rubric); all 166 tests pass | JDG 06, JDG 08, ENV 10, ENV 11, TST 04, TST 05 are now unblocked; partial server tasks (API 02, 03, 06, 07) can now wire against the real env |
39
+ | 2026-03-07 | Person B (Ayush) | JDG 04, JDG 05, ENV 06 finalization | Refined the draft implementations to match final acceptance criteria | JDG 04 needed a zero-clamp floor and JDG 05 needed a named-penalty extension point for bounded-tool diagnostics; ENV 06 needed to distinguish timeout from no-agreement verdicts | `compute_total_reward` now clamps at 0.0; `build_reward_breakdown` accepts optional `penalties: dict[str, float]` for named penalty keys like `invalid_tool_use` and `unsupported_claim`; terminal-without-agreement path now returns `timeout` when max rounds reached vs `no_agreement` otherwise; added 8 new tests in `test_reward.py` and 4 new tests in `test_env.py`; 178 tests pass across the full suite | None |
40
+ | 2026-03-07 | Person B (Ayush) | API 03 | Completed the `POST /step` endpoint task assigned to Person C by fixing stale replay logging and adding endpoint tests | The `_build_episode_log()` helper still hardcoded stub audit notes, rebuilt `RewardBreakdown` from state, and used `accept`/`revise` instead of the real `timeout`/`no_agreement` verdicts; both REST and WebSocket terminal paths used the stale helper; and no `/step` endpoint tests existed | Updated `_build_episode_log()` to accept the terminal `StepResult` and use its real `reward_breakdown`, `judge_notes`, and `verdict`; updated both REST `/step` and WebSocket step completion paths to pass the result; fixed `_StubEnv` reference to removed helper; added five endpoint tests covering happy path, invalid session 404, terminal real reward breakdown, semantic invalid action as 200 with `info.error`, and replay with real judge data; all 183 tests pass | API 14 and API 18 are now closer to completion; TST 06 is partially covered by the new tests |
41
+ | 2026-03-07 | Person B (Ayush) | API 06 and TST 07 | Executed the WebSocket session handler task and its test task even though both were assigned to Person C | The WebSocket handler already existed in `server/app.py` but had no test coverage, and completing `API 06` was needed to unblock `TRN 03` and `TRN 13` in Person B's own lane | Added 12 WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict with real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`; all 195 tests pass; `TRN 03` and `TRN 13` are now unblocked for Person B | `TRN 03` and `TRN 13` are now the next Person B tasks |
42
+ | 2026-03-08 | Person B (Ayush) | API 13 | Executed the task even though it was assigned to Person C | The CORS middleware already existed in `server/app.py`, but the task was still partial because frontend-origin verification had not been made explicit | Added three server tests covering localhost Vite preflight, Hugging Face Space origin preflight, and disallowed-origin rejection; `API 13` is now recorded complete in the source of truth and owner trackers | `API 02`, `API 04`, `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
43
+ | 2026-03-08 | Person B (Ayush) | API 04 | Executed the task even though it was assigned to Person C | The `/scenarios` endpoint and its focused tests already met the acceptance criteria, but the task was still marked partial in the trackers | Recorded `API 04` complete in the source of truth and owner trackers based on the existing typed response model, normalized family list, and five dedicated endpoint tests | `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
44
+ | 2026-03-08 | Person B (Ayush) | API 02 | Completed the `POST /reset` endpoint verification and test closure even though the task was assigned to Person C | The endpoint already worked against the real env via `_make_env()` but had no dedicated test coverage and was still marked partial in the tracker | Added seven dedicated `/reset` endpoint tests covering response shape, both-role observation, explicit session_id reuse with prior-env close, default params, all scenario and difficulty combos, and seed determinism; all 202 tests pass; `API 14` and `UI 06` are now closer to completion | None |
45
+ | 2026-03-08 | Person B (Ayush) | TRN 13 | Implemented `replicalab/client.py` as specified in the task backlog | `API 06` was complete and `TRN 13` was the next unblocked Person B task | Created `ReplicaLabClient` with dual-transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager, internal session tracking, typed Pydantic returns, and 24 tests covering both transports; all 231 tests pass | `TRN 03` is now the next unblocked Person B task |
46
+ | 2026-03-08 | Person B (Ayush) | API 07 | Completed the WebSocket idle-timeout and graceful-disconnect verification even though the task was assigned to Person C | The idle-timeout logic and `finally: env.close()` path already existed in `server/app.py`, but the task was still partial because resource-cleanup verification had not been made explicit | Added two focused WebSocket tests covering idle timeout close code `1000` and exactly-once `env.close()` on disconnect; `API 07` is now recorded complete in the source of truth and owner trackers | `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
47
+ | 2026-03-08 | Person B (Ayush) | API 08 | Completed the Docker build and run verification even though the task was assigned to Person C | The Dockerfile existed but had never been verified end to end; editable install failed inside Docker, and `httpx` plus `websocket-client` were missing from `server/requirements.txt` | Fixed `pip install -e .` to `pip install .` in both `server/Dockerfile` and root `Dockerfile`; added `httpx` and `websocket-client` to `server/requirements.txt`; rebuilt without cache; verified container starts with `"env":"real"` and all four endpoints (`/health`, `/scenarios`, `/reset`, `/step`) respond correctly; added verified endpoint commands to `docs/max/deployment.md` | `API 09` and `API 16` are now unblocked |
48
+ | 2026-03-08 | Person B (Ayush) | Recovery sync, API 09, API 15, TST 04, TST 05 | Recovered the lost env or server or client or test bundle from unreachable git objects and re-synced the deployment/testing trackers to the validated repo state | The branch had rolled back to `5538ba0`, which left the working code, deployment metadata, and tracker files out of sync even though the recovered code now passes 231 tests, Docker validation, and OpenEnv validation | Restored the missing runtime files, revalidated the real env and Docker path, recorded HF Space metadata tasks (`API 09`, `API 15`) as complete, and closed the two reward-regression tests (`TST 04`, `TST 05`) that are already covered in `tests/test_reward.py` | Live HF Space bring-up remains `API 10` |
49
 
docs/completion.md CHANGED
@@ -20,30 +20,32 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
20
  | Metric | Value |
21
  |--------|-------|
22
  | Total tasks | 152 |
23
- | Completed | 38 |
24
- | Partial / active | 10 |
25
- | Remaining | 104 |
26
- | **Completion rate** | **25.00%** |
27
 
28
  ### Completion by Person
29
 
30
  | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
31
  |--------|----------|----------------|----------------------|-----------|------|
32
- | Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 20 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `AGT 05` done by Person B) | 28 | 42.86% |
33
- | Person B (Ayush) | 29 (27 solo + 2 shared with A) | 10 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 11`) | 0 | 19 | 34.48% |
34
- | Max (Person C) | 41 | 1 (`FND 11`) | 7 (`FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 12` done by others) | 33 | 19.51% |
35
  | Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
36
  | All (shared) | 3 | 2 (`FND 08`, `AGT 05`) | 0 | 1 | 66.67% |
37
 
38
  Note: Person B (Ayush) has completed two shared tasks in their own lane
39
- (`FND 08`, `AGT 05`) plus eight solo tasks in their own lane (`MOD 09`,
40
- `SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 06`, `AGT 07`, `AGT 11`), and has also executed twenty-five tasks outside their assigned
41
- ownership (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`,
 
42
  `FND 09`, `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`,
43
- `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`) to keep the Kian, Max, and Kush
44
- dependency chain moving. Ayush now has one fully unblocked implementation
45
- task available: `AGT 03`, with `AGT 10` reduced to a single remaining
46
- external dependency on `JDG 06`.
 
47
 
48
  ---
49
 
@@ -51,14 +53,7 @@ external dependency on `JDG 06`.
51
 
52
  | ID | Assigned To | Current Status | Remaining Acceptance Item |
53
  |----|-------------|----------------|---------------------------|
54
- | API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the stub env | Real env dependency and task-owner sign-off |
55
- | API 02 | Max (Person C) | `/reset` works locally against the stub env and now seeds normalized math / ML / finance scenarios through the shared generator | Real env reset dependency and task-owner sign-off |
56
- | API 03 | Max (Person C) | `/step` works locally against the stub env | Real env step dependency and task-owner sign-off |
57
- | API 04 | Max (Person C) | `/scenarios` returns the normalized scenario-family list from the shared generator | Real env exposure and task-owner sign-off |
58
- | API 06 | Max (Person C) | WebSocket reset, ping, and step work locally against the stub env, including normalized scenario-family resets | Real env integration and task-owner sign-off |
59
- | API 07 | Max (Person C) | Idle timeout and cleanup logic exist in the WebSocket path | Real env disconnect cleanup verification |
60
- | API 08 | Max (Person C) | `server/Dockerfile` exists | Local Docker build and run verification |
61
- | API 13 | Max (Person C) | CORS middleware exists for dev and hosted origins | Frontend integration verification |
62
  | API 14 | Max (Person C) | REST session isolation exists in the server stub path | Concurrent-session verification against the real env |
63
  | OBS 02 | Max (Person C) | Structured local logging exists in `server/app.py` | Logging behavior needs real-env usage confirmation |
64
 
@@ -95,6 +90,34 @@ external dependency on `JDG 06`.
95
  | SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
96
  | SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
97
  | SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  ### Person B (Ayush) - Completed own tasks
100
 
@@ -104,11 +127,14 @@ external dependency on `JDG 06`.
104
  | SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
105
  | AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
106
  | AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
107
- | AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic non-LLM Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
108
  | AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
109
  | AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
110
  | AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
111
  | AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md` | 2026-03-08 | Recorded `Qwen3-4B` as the primary Scientist training model with `Qwen3-8B` as the H100-only stretch fallback, and surfaced the decision in the README so the training path uses one canonical model choice. | Decision is recorded and all team members know which model will be fine tuned | Yes - verified by the decision record and README update |
 
 
 
112
 
113
  ### Kush (Person D) - Completed on behalf of others
114
 
@@ -179,6 +205,28 @@ external dependency on `JDG 06`.
179
  | AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
180
  | AGT 07 | `AGT 10` now only waits on `JDG 06`, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
181
  | AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
  ### Current Unblocked and Active Tasks
184
 
@@ -186,15 +234,20 @@ external dependency on `JDG 06`.
186
  |----|-------|------|-------------|
187
  | FND 13 | Kush (Person D) | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 |
188
  | UI 01 | Kush (Person D) | Create application shell with three panel layout | FND 03 |
189
- | AGT 03 | Person B (Ayush) | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 |
190
  | MOD 06 | Kian (Person A) | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 |
191
  | MOD 07 | Max (Person C) | Add state serialization helper for replay logs | MOD 04 |
192
- | JDG 01 | Kian (Person A) | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | SCN 08 |
193
- | JDG 02 | Kian (Person A) | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | SCN 07, AGT 05 |
194
- | JDG 03 | Kian (Person A) | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | SCN 08 |
195
  | SCN 13 | Kian (Person A) | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 |
196
- | ENV 01 | Kian (Person A) | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 |
197
  | DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
 
 
 
 
 
 
 
 
 
198
 
199
  ---
200
 
@@ -205,12 +258,12 @@ external dependency on `JDG 06`.
205
  | E01. Foundations and repository setup | 13 | 12 | 92.31% |
206
  | E02. Domain models, validation, state contracts | 12 | 8 | 66.67% |
207
  | E03. Scenario engine and constraint generation | 13 | 11 | 84.62% |
208
- | E04. Scientist agent and Lab Manager policy | 11 | 7 | 63.64% |
209
- | E05. Judge engine and reward logic | 11 | 0 | 0% |
210
- | E06. OpenEnv environment implementation | 11 | 0 | 0% |
211
- | E07. API, server, Docker, deployment | 19 | 0 | 0% |
212
- | E08. RL training pipeline and evaluation | 15 | 0 | 0% |
213
  | E09. Frontend, UX, replay, demo views | 15 | 0 | 0% |
214
  | E10. Logging, replay, and observability | 9 | 0 | 0% |
215
- | E11. Testing and quality gates | 12 | 0 | 0% |
216
  | E12. README, demo video, submission packaging | 11 | 0 | 0% |
 
20
  | Metric | Value |
21
  |--------|-------|
22
  | Total tasks | 152 |
23
+ | Completed | 67 |
24
+ | Partial / active | 3 |
25
+ | Remaining | 82 |
26
+ | **Completion rate** | **44.08%** |
27
 
28
  ### Completion by Person
29
 
30
  | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
31
  |--------|----------|----------------|----------------------|-----------|------|
32
+ | Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 38 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `AGT 05`, `ENV 01` to `ENV 08`, `JDG 01` to `JDG 05`, `TST 01` to `TST 05` done by Person B) | 10 | 79.59% |
33
+ | Person B (Ayush) | 29 (27 solo + 2 shared with A) | 13 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 08`, `AGT 11`, `TRN 13`) | 0 | 16 | 44.83% |
34
+ | Max (Person C) | 41 | 1 (`FND 11`) | 17 (`FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 12` done by others, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 09`, `API 13`, `API 15`, `TST 07` done by Person B) | 23 | 43.90% |
35
  | Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
36
  | All (shared) | 3 | 2 (`FND 08`, `AGT 05`) | 0 | 1 | 66.67% |
37
 
38
  Note: Person B (Ayush) has completed two shared tasks in their own lane
39
+ (`FND 08`, `AGT 05`) plus eleven solo tasks in their own lane (`MOD 09`,
40
+ `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`, `AGT 04`, `AGT 06`, `AGT 07`,
41
+ `AGT 08`, `AGT 11`, `TRN 13`), and has also executed a large cross-owner
42
+ bundle (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`,
43
  `FND 09`, `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`,
44
+ `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `ENV 01` to `ENV 08`, `JDG 01`
45
+ to `JDG 05`, `TST 01` to `TST 05`, `API 02`, `API 03`, `API 04`, `API 06`,
46
+ `API 07`, `API 08`, `API 09`, `API 13`, `API 15`, `TST 07`) to keep the
47
+ Kian, Max, and Kush dependency chain moving.
48
+ `TRN 13` is complete; `TRN 03` is the next unblocked task for Person B.
49
 
50
  ---
51
 
 
53
 
54
  | ID | Assigned To | Current Status | Remaining Acceptance Item |
55
  |----|-------------|----------------|---------------------------|
56
+ | API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the real env | Task-owner sign-off and final deployment-path polish |
 
 
 
 
 
 
 
57
  | API 14 | Max (Person C) | REST session isolation exists in the server stub path | Concurrent-session verification against the real env |
58
  | OBS 02 | Max (Person C) | Structured local logging exists in `server/app.py` | Logging behavior needs real-env usage confirmation |
59
 
 
90
  | SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
91
  | SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
92
  | SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
93
+ | ENV 01 | E06 | Person A | Create `ReplicaLabEnv` class skeleton | `replicalab/env/replicalab_env.py`, `replicalab/env/__init__.py` | 2026-03-08 | Added a real `ReplicaLabEnv` module as a drop-in replacement for the former in-server stub, ported the working stub behavior into the environment package, wired scenario-pack-backed reset or step or state or close methods with follow-on `TODO(ENV XX)` markers, and removed the old stub-only marker from `StepInfo` payloads. | Environment class imports and instantiates without runtime errors | Yes - verified with a direct `ReplicaLabEnv.reset(...) -> step(...) -> state() -> close()` smoke run and `python -m pytest` (`111 passed`) |
94
+ | JDG 01 | E05 | Person A | Implement rigor or objective-validity score | `replicalab/scoring/rigor.py`, `replicalab/utils/text.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_rigor(protocol, scenario)` with weighted sub-scores for structural completeness (0.30), success criteria coverage (0.40), and required element coverage (0.30). Uses shared `element_tokens` from `replicalab/utils/text.py`. Five focused tests in `test_reward.py` cover quality ordering, determinism, controls impact, rationale length, and all-domain range validation. | Score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
95
+ | JDG 02 | E05 | Person A | Implement feasibility score | `replicalab/scoring/feasibility.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_feasibility(protocol, scenario, check=None)` that derives a continuous [0,1] signal from `FeasibilityCheckResult` (AGT 05). Seven dimensions weighted equally (1/7) with partial credit for budget, equipment, reagents, and staff. Accepts optional pre-computed check to avoid redundant work. Six focused tests cover viable protocol, infeasible ordering, pre-computed check equivalence, determinism, partial credit, and all-domain range. | Score is between 0 and 1 and matches normalized constraint logic | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
96
+ | JDG 03 | E05 | Person A | Implement fidelity score | `replicalab/scoring/fidelity.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_fidelity(protocol, scenario)` with substitution-aware scoring: required element coverage (0.50, direct match=1.0, substitution=0.7), flexible element alignment (0.20, bonus only), target metric alignment (0.20), and technique appropriateness (0.10). Five focused tests cover aligned vs misaligned ordering, determinism, substitution partial credit, target metric impact, and all-domain range. | Score is between 0 and 1 and matches rubric examples for plan and evidence alignment | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
97
+ | JDG 04 | E05 | Person A | Implement total reward formula | `replicalab/scoring/rubric.py`, `tests/test_reward.py` | 2026-03-07 | `compute_total_reward(breakdown)` implements `10 × rigor × feasibility × fidelity + bonuses − penalties` with `max(0.0, ...)` floor clamp. Eight new tests in `test_reward.py` verify perfect-vs-broken ordering, zero-feasibility behavior, efficiency bonus ordering, exact penalty subtraction, zero-clamp floor, determinism, external penalties injection, and default-empty penalties. Seven existing rubric tests in `test_env.py` also cover the formula. | Total reward formula matches agreed math, clamps at zero, and returns consistent output for plan quality and bounded tool behavior | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass) |
98
+ | JDG 05 | E05 | Person A | Build reward breakdown object | `replicalab/scoring/rubric.py`, `replicalab/scoring/__init__.py`, `tests/test_reward.py` | 2026-03-07 | `build_reward_breakdown(...)` accepts an optional `penalties: dict[str, float]` parameter for named penalty keys (e.g. `invalid_tool_use`, `unsupported_claim`) from bounded-tool diagnostics without reopening the model contract. Returns a typed `RewardBreakdown` with rigor, feasibility, fidelity, efficiency_bonus, communication_bonus, and penalties dict. Exported through `replicalab.scoring`. | Breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics extension point | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass) |
99
+ | ENV 02 | E06 | Person A | Implement real reset wiring | `replicalab/env/replicalab_env.py` | 2026-03-08 | `_make_observation()` now uses the scenario pack as source of truth for booked/out-of-stock/safety data instead of empty placeholders. Eight reset tests verify both roles populated, booked/out-of-stock preserved, all templates and difficulties. | Reset returns initial observations with full scenario data | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
100
+ | ENV 03 | E06 | Person A | Implement Scientist turn with validation | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_validate_scientist_action()` that runs `validate_protocol()` on proposals and returns structured error strings without crashing the env. Invalid actions don't advance the round. | Valid action updates state, invalid action returns structured error | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
101
+ | ENV 04 | E06 | Person A | Implement Lab Manager response step | `replicalab/env/replicalab_env.py` | 2026-03-08 | `_lab_manager_action()` uses the full grounded pipeline: `check_feasibility()` → `suggest_alternative()` → `compose_lab_manager_response()`. | Lab Manager response is grounded in feasibility check results | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
102
+ | ENV 05 | E06 | Person A | Centralize termination logic | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_check_termination()`: Scientist accept with existing protocol OR max_rounds. Lab Manager accept does NOT auto-terminate. | Episode terminates on agreement or round limit | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
103
+ | ENV 06 | E06 | Person A | Wire real judge scoring | `replicalab/env/replicalab_env.py`, `tests/test_env.py` | 2026-03-07 | Terminal accept steps call `build_reward_breakdown()` and `compute_total_reward()` with real rigor/feasibility/fidelity scores stored in `EpisodeState`. Terminal-without-agreement path now distinguishes `timeout` (max rounds) from `no_agreement` verdict. Four new tests in `TestEnvReward` verify agreement-terminal breakdown/notes/verdict, no-agreement determinism, timeout verdict, and state-stored component scores. | Final step returns total reward, breakdown info, and deterministic penalties or bonuses; verdict distinguishes timeout from no_agreement | Yes - verified with `python -m pytest tests/test_env.py` (36 tests pass) and `python -m pytest` (178 tests pass) |
104
+ | ENV 07 | E06 | Person A | Implement state() deep snapshot | `replicalab/env/replicalab_env.py` | 2026-03-08 | `state()` now returns `self._state.model_copy(deep=True)` so callers get an independent snapshot. Two tests verify mutation isolation. | State snapshot is independent of env internals | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
105
+ | ENV 08 | E06 | Person A | Implement close() with lifecycle guard | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_closed` flag, idempotent `close()`, `_ensure_open()` guard on `step()`, and `reset()` reopens a closed env. Three tests verify idempotency, step-after-close raises, and reset-reopens. | Close frees resources and does not throw; step after close raises | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
106
+ | TST 01 | E11 | Person A | Add reset returns valid observations test | `tests/test_env.py` | 2026-03-08 | Eight tests in `TestReset` class covering both roles populated, scientist fields, lab manager fields, booked/out-of-stock preservation, state round zero, episode ID, clearing previous episode, and all templates/difficulties. | Test confirms both roles receive valid structured observations | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
107
+ | TST 02 | E11 | Person A | Add valid action step test | `tests/test_env.py` | 2026-03-08 | Eight tests in `TestStep` class covering round advancement, observation shape, conversation history, accept termination, real reward scores, max round termination, step info fields, and full propose-then-accept episode. | Valid action advances round and returns correct shape | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
108
+ | TST 03 | E11 | Person A | Add invalid action handling test | `tests/test_env.py` | 2026-03-08 | Four tests in `TestInvalidAction` class covering error string on invalid duration, env survival after error, no round advancement on invalid action, and request_info always passes. | Invalid action yields structured error and env survives | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
109
+ | TST 04 | E11 | Person A | Add perfect protocol high reward test | `tests/test_reward.py` | 2026-03-08 | Added reward-regression coverage proving a fully aligned protocol scores higher than a broken baseline and stays ordered consistently across reruns. | Perfect protocol scores higher than baseline and broken protocol | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) |
110
+ | TST 05 | E11 | Person A | Add zero dimension or penalty behavior test | `tests/test_reward.py` | 2026-03-08 | Added reward-regression coverage for zero-feasibility collapse, exact penalty subtraction, and zero-floor clamp behavior so timeout and penalty paths lower reward deterministically. | Zero feasibility or timeout lowers reward as expected | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) |
111
+ | API 03 | E07 | Person C | Add `POST /step` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-07 | Fixed `_build_episode_log()` to take the real `StepResult` instead of rebuilding reward data from state with stale stub values. Both REST `/step` and WebSocket step handler now pass the terminal `StepResult` to the updated helper so replay logs use real `reward_breakdown`, `judge_notes`, and `verdict` (including `timeout` vs `no_agreement`). Added five endpoint tests covering reset-then-step happy path, invalid session ID 404, terminal step with real reward breakdown, semantic invalid action returning 200 with `info.error`, and replay with real judge data. | Step endpoint accepts valid action and returns step result | Yes - verified with `python -m pytest tests/test_server.py` (10 tests pass) and `python -m pytest` (183 tests pass) |
112
+ | API 06 | E07 | Person C | Add WebSocket session handler with isolated env per connection | `server/app.py`, `tests/test_server.py` | 2026-03-07 | WebSocket handler at `/ws` supports `reset`, `step`, and `ping` message types with per-connection env isolation, idle timeout, and replay storage on terminal episodes. Twelve WebSocket tests cover ping-pong, reset observation, step result, full episode real reward, invalid JSON, missing action field, invalid action payload, unknown message type, session isolation, semantic invalid action returning `step_ok` with `info.error`, timeout verdict proving real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`. | WebSocket session handler supports reset, step, ping with isolated env per connection and correct replay storage | Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) and `python -m pytest` (195 tests pass) |
113
+ | TST 07 | E11 | Person C | Add WebSocket session handler tests | `tests/test_server.py` | 2026-03-07 | Twelve focused WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict, and replay log persistence with real judge data. Tests verify that structurally valid but semantically invalid actions return `step_ok` with `info.error` (not WS error frames), matching the env contract. | WebSocket tests cover happy path, error handling, session isolation, and real-env integration | Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) |
114
+ | API 02 | E07 | Person C | Add `POST /reset` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-08 | `/reset` endpoint creates a new env (or closes the prior one when reusing `session_id`), calls `env.reset(...)`, persists env, `last_active`, and `episode_id` in the in-memory REST session store, and returns `session_id`, `episode_id`, `observation`. Seven dedicated tests cover response shape, both-role observation, explicit session_id reuse, prior-env close on reuse, default params, all scenario/difficulty combos, and seed determinism. | Reset endpoint starts a new episode and returns initial observation | Yes - verified with `python -m pytest tests/test_server.py` (29 tests pass) and `python -m pytest` (202 tests pass) |
115
+ | API 04 | E07 | Person C | Add `GET /scenarios` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-08 | `GET /scenarios` returns the `available_scenario_families()` output through the typed `ScenariosResponse` model. Five focused tests cover status code, response shape, all three scenario families, the expected `easy`, `medium`, and `hard` difficulties, and the absence of extra keys. | Endpoint lists available scenario families and difficulties | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
116
+ | API 07 | E07 | Person C | Add idle timeout and graceful disconnect cleanup | `server/app.py`, `tests/test_server.py` | 2026-03-08 | Verified the existing WebSocket idle-timeout and disconnect cleanup path with two focused tests: one monkeypatches the idle timeout to 0.5s and confirms the server closes with code 1000 when no message arrives, and one wraps `_make_env()` to confirm `env.close()` is called exactly once from the `finally` block on disconnect. | Stale connections close cleanly and the environment closes without leak | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
117
+ | API 13 | E07 | Person C | Add CORS middleware configuration for frontend origins in dev and production | `server/app.py`, `tests/test_server.py` | 2026-03-08 | Confirmed the existing FastAPI CORS middleware allows the local Vite frontend origin plus `https://*.hf.space`, and added three explicit preflight tests covering localhost allowance, HF Space allowance, and disallowed-origin rejection. | Frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
118
+ | API 08 | E07 | Person C | Build Dockerfile with Python app startup on port 7860 | `server/Dockerfile`, `Dockerfile`, `server/requirements.txt`, `docs/max/deployment.md` | 2026-03-08 | Fixed editable install (`-e .` → `. --no-deps`) in both `server/Dockerfile` and root `Dockerfile`, added `httpx` and `websocket-client` to `server/requirements.txt` (required by `replicalab.client`), rebuilt without cache. Verified Docker container starts with the **real env** (`"env":"real"`), and all four endpoints work: `GET /health`, `GET /scenarios`, `POST /reset`, `POST /step`. Added verified endpoint commands to `docs/max/deployment.md`. | Local Docker run serves app on port 7860 | Yes - verified with `docker build -f server/Dockerfile -t replicalab . && docker run -p 7860:7860 replicalab` and curl against all four endpoints |
119
+ | API 09 | E07 | Person C | Add Hugging Face Space metadata and deploy instructions | `README.md`, `Dockerfile`, `docs/max/deployment.md` | 2026-03-08 | Added the Hugging Face Spaces YAML frontmatter to the root README, created the root-level `Dockerfile` required by the Docker SDK, and documented Space creation, git remote setup, push, logs, and secret management in `docs/max/deployment.md`. | Space config is valid for Docker app deployment | Yes - verified against HF Spaces Docker deployment requirements |
120
+ | API 15 | E07 | Person C | Create HF Space README.md with YAML frontmatter | `README.md` | 2026-03-08 | Added the required Spaces frontmatter fields (`sdk: docker`, `app_port: 7860`, title, emoji, colors, pinned) to the root README so Hugging Face parses the Space metadata correctly on push. | HF Space config is valid and Space launches correctly from the metadata | Yes - verified against the HF Spaces frontmatter schema |
121
 
122
  ### Person B (Ayush) - Completed own tasks
123
 
 
127
  | SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
128
  | AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
129
  | AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
130
+ | AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic baseline Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
131
  | AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
132
  | AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
133
  | AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
134
  | AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md` | 2026-03-08 | Recorded `Qwen3-4B` as the primary Scientist training model with `Qwen3-8B` as the H100-only stretch fallback, and surfaced the decision in the README so the training path uses one canonical model choice. | Decision is recorded and all team members know which model will be fine tuned | Yes - verified by the decision record and README update |
135
+ | AGT 03 | E04 | Add parse plus retry strategy for malformed model output | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-07 | Added `call_scientist_with_retry(...)` with error-specific correction prompts, bounded retry loop, and exposed `RetryMetadata` telemetry. Seven focused tests cover first-try success, malformed-then-valid, invalid-then-valid, exhaustion, correction message content, and metadata serialization. | Malformed output triggers at least one controlled retry or explicit failure | Yes - verified with `python -m pytest tests/test_scientist_policy.py` (7 retry tests pass) |
136
+ | AGT 08 | E04 | Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-07 | Added bounded-tool policy block to `build_scientist_system_prompt(...)` naming `search_evidence`, `run_code_check`, and `inspect_image` with explicit rules. Added 24 new tests covering parser happy paths (propose, accept, prose-wrapped), parser edge cases (empty, whitespace, list, extra keys, `to_dict()`), system prompt across all 3 domains plus dict coercion, bounded-tool policy assertions across all domains, role-boundary and output-contract assertions, formatter edge cases (final round, empty-list protocol), and baseline domain inference and forced-accept behavior. | Tests cover happy path, malformed output handling, and stable tool-policy reminders | Yes - verified with `python -m pytest tests/test_scientist_policy.py` (46 tests pass) and `python -m pytest tests/` (111 tests pass) |
137
+ | TRN 13 | E08 | Create reusable environment client module | `replicalab/client.py`, `tests/test_client.py` | 2026-03-08 | Added `ReplicaLabClient` with dual transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager support, internal session ID tracking, typed returns mapped to Pydantic models, and constructor-level transport selection. Twenty-four tests cover both transports: connect, reset, step, full episode, replay, context manager, error paths, semantic invalid action handling, and constructor validation. | Client module can be imported by notebook and other consumers without duplicating connection logic | Yes - verified with `python -m pytest tests/test_client.py` (24 tests pass) and `python -m pytest` (231 tests pass) |
138
 
139
  ### Kush (Person D) - Completed on behalf of others
140
 
 
205
  | AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
206
  | AGT 07 | `AGT 10` now only waits on `JDG 06`, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
207
  | AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
208
+ | ENV 01 | ENV 02, ENV 08, and the real-environment import path that partial server tasks now depend on |
209
+ | JDG 01 | Together with JDG 02 and JDG 03, unblocks JDG 04 (total reward formula) |
210
+ | JDG 02 | Together with JDG 01 and JDG 03, unblocks JDG 04 (total reward formula) |
211
+ | JDG 03 | Together with JDG 01 and JDG 02, unblocks JDG 04 (total reward formula) |
212
+ | JDG 04 | JDG 05, JDG 08, TST 04, TST 05 |
213
+ | JDG 05 | JDG 06, JDG 07, JDG 09, JDG 10, JDG 11, ENV 06 |
214
+ | ENV 02 | ENV 03, ENV 07, ENV 10, TST 01, API 02 (partial → full) |
215
+ | ENV 03 | ENV 04, ENV 05, TST 02, TST 03 |
216
+ | ENV 04 | ENV 05, TST 02 |
217
+ | ENV 05 | ENV 06, TST 02 |
218
+ | ENV 06 | ENV 07, ENV 09, ENV 11, API 03 (partial → full), API 06 (partial → full), OBS 07 |
219
+ | API 06 | TRN 03, TRN 13 |
220
+ | API 09 | API 10, API 17 |
221
+ | TST 07 | No new dependencies |
222
+ | ENV 07 | ENV 10 (partial unblock) |
223
+ | ENV 08 | API 07 (partial → full) |
224
+ | TST 01 | No new dependencies |
225
+ | TST 02 | No new dependencies |
226
+ | TST 03 | No new dependencies |
227
+ | API 02 | API 14 (partial → closer to full), UI 06 |
228
+ | TRN 13 | TRN 03 now has both its dependencies met (API 06 + TRN 13) |
229
+ | API 08 | API 09, API 16, API 19 |
230
 
231
  ### Current Unblocked and Active Tasks
232
 
 
234
  |----|-------|------|-------------|
235
  | FND 13 | Kush (Person D) | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 |
236
  | UI 01 | Kush (Person D) | Create application shell with three panel layout | FND 03 |
237
+ | AGT 09 | Kian (Person A) | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 |
238
  | MOD 06 | Kian (Person A) | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 |
239
  | MOD 07 | Max (Person C) | Add state serialization helper for replay logs | MOD 04 |
 
 
 
240
  | SCN 13 | Kian (Person A) | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 |
 
241
  | DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
242
+ | JDG 06 | Kian (Person A) | Add optional plain English explanation function from reward breakdown | JDG 05 |
243
+ | JDG 08 | Kian (Person A) | Add score determinism tests and edge case tests | JDG 01 to JDG 05 |
244
+ | ENV 10 | Kian (Person A) | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 |
245
+ | JDG 09 | Kush (Person D) | Create mock score cards and language for frontend | JDG 05 |
246
+ | API 10 | Max (Person C) | Deploy live Space and verify health, reset, and step | API 09 |
247
+ | API 17 | Max (Person C) | Document secrets and API key management for hosted deployment and Colab | API 09 |
248
+ | TRN 03 | Person B (Ayush) | Implement env client wrapper for training rollouts | API 06, TRN 13 |
249
+
250
+ Note: Person B (Ayush) has `TRN 03` as the next unblocked task (`TRN 13` is now complete). `AGT 10` still waits on `JDG 06` (Person A). The remaining TRN chain waits on `API 10` (Person C) and judge tasks (Person A).
251
 
252
  ---
253
 
 
258
  | E01. Foundations and repository setup | 13 | 12 | 92.31% |
259
  | E02. Domain models, validation, state contracts | 12 | 8 | 66.67% |
260
  | E03. Scenario engine and constraint generation | 13 | 11 | 84.62% |
261
+ | E04. Scientist agent and Lab Manager policy | 11 | 9 | 81.82% |
262
+ | E05. Judge engine and reward logic | 11 | 5 | 45.45% |
263
+ | E06. OpenEnv environment implementation | 11 | 8 | 72.73% |
264
+ | E07. API, server, Docker, deployment | 19 | 9 | 47.37% |
265
+ | E08. RL training pipeline and evaluation | 15 | 1 | 6.67% |
266
  | E09. Frontend, UX, replay, demo views | 15 | 0 | 0% |
267
  | E10. Logging, replay, and observability | 9 | 0 | 0% |
268
+ | E11. Testing and quality gates | 12 | 6 | 50.00% |
269
  | E12. README, demo video, submission packaging | 11 | 0 | 0% |
docs/fnd08_frozen_json_contract.md CHANGED
@@ -16,6 +16,17 @@ This document freezes the JSON contract for the shared ReplicaLab data models so
16
  - Person C API payload examples
17
  - Person D frontend and replay mocks
18
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Global conventions
20
 
21
  - All JSON keys use `snake_case`.
 
16
  - Person C API payload examples
17
  - Person D frontend and replay mocks
18
 
19
+ ## Tool-Capability Addendum
20
+
21
+ The richer-capability MVP adds bounded search, code-check, and image-inspection
22
+ support below this frozen contract.
23
+
24
+ This addendum does **not** reopen the outward action schema from `FND 08`.
25
+ The final outward actions remain `ScientistAction` and `LabManagerAction`.
26
+ Bounded tool use will be represented through scenario or evidence metadata,
27
+ environment-side tool traces, and `StepResult.info` or replay payloads rather
28
+ than new outward action types for the MVP.
29
+
30
  ## Global conventions
31
 
32
  - All JSON keys use `snake_case`.
docs/kian/task_breakdown.md CHANGED
@@ -6,28 +6,39 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
6
 
7
  ## Current status
8
 
9
- - `FND 04`, `FND 08`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
10
- - Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the later judge feasibility score
11
- - `SCN 01` to `SCN 10` are also complete, so the deterministic scenario layer now exists in code
12
- - The Kian lane no longer needs to start with scenario seeding or template scaffolding
13
- - The remaining high-leverage work is semantic edge-case validation, booking conflicts, judge logic, and the real environment
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ---
16
 
17
  ## Recommended execution order
18
 
19
- 1. `MOD 06` -- extend the new semantic validation layer to catch impossible edge cases early
20
- 2. `SCN 13` -- deepen the normalized scenario layer with booking and scheduling conflicts
21
- 3. `JDG 01`, `JDG 02`, and `JDG 03` -- start the deterministic reward components that are now unblocked
22
- 4. `JDG 04` and `JDG 05` -- complete the reward pipeline once the component scorers exist
23
- 5. `ENV 01` and `ENV 02` -- once typed state and core scoring pieces are in place, start the real OpenEnv environment path
24
 
25
  ---
26
 
27
  ## Why this order
28
 
29
  - `MOD 06` is the smallest remaining contract-hardening task and builds directly on the completed `MOD 05` validator.
30
- - `SCN 13` is the remaining scenario-layer depth task; it builds naturally on the completed normalized resource model.
31
- - `JDG 01` and `JDG 03` can start immediately because their only formal prerequisite, `SCN 08`, is already complete.
32
- - `JDG 02` is now also unblocked because the deterministic feasibility checker from `AGT 05` exists.
33
- - The environment path can now start from typed state and step-result contracts instead of loose dict-based placeholders.
 
6
 
7
  ## Current status
8
 
9
+ - `FND 04`, `FND 08`, `FND 09`, `MOD 01` to `MOD 05`, `MOD 11`, `MOD 12` are complete
10
+ - Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the judge feasibility score
11
+ - `SCN 01` to `SCN 10` are complete, so the deterministic scenario layer exists in code
12
+ - `ENV 01` to `ENV 08` are all complete the full environment lifecycle (reset, step, validate, Lab Manager response, termination, judge scoring, state snapshot, close) works end-to-end
13
+ - `JDG 01` to `JDG 05` are complete — the full deterministic reward pipeline (rigor, feasibility, fidelity, total reward formula with floor clamp, breakdown builder with named penalty extension point) is wired and tested
14
+ - `TST 01` to `TST 05` are complete with 36 env tests and 26 reward tests passing
15
+ - The remaining high-leverage work is semantic edge-case validation, booking conflicts, judge explanation output, and environment test suite expansion
16
+
17
+ Bounded-tool scope note:
18
+
19
+ 1. Kian-owned scenario, judge, and environment tasks now need to support
20
+ bounded `search`, `code_check`, and `image_inspection` traces without
21
+ changing the outer action contract.
22
+ 2. Training reward must remain deterministic and must not depend on live web.
23
+ 3. Frozen evidence packs are the default training-time source of tool inputs.
24
+ 4. Audio remains out of scope.
25
 
26
  ---
27
 
28
  ## Recommended execution order
29
 
30
+ 1. `MOD 06` -- extend the semantic validation layer to catch impossible edge cases early
31
+ 2. `SCN 13` -- deepen the normalized scenario layer with booking, scheduling, and evidence-pack support
32
+ 3. `JDG 06` -- add plain English explanation function from reward breakdown (unblocks AGT 10 for Ayush)
33
+ 4. `JDG 08` -- add score determinism tests and edge case tests
34
+ 5. `ENV 10` -- add comprehensive env tests (reset, step, invalid action, timeout, deterministic replay)
35
 
36
  ---
37
 
38
  ## Why this order
39
 
40
  - `MOD 06` is the smallest remaining contract-hardening task and builds directly on the completed `MOD 05` validator.
41
+ - `SCN 13` is the remaining scenario-layer depth task; it now also needs to carry booking-conflict and evidence-pack data in a deterministic way.
42
+ - `JDG 06` is the highest-leverage remaining judge task because it directly unblocks `AGT 10` (Ayush's prompt text files) and `JDG 11` (structured audit payload).
43
+ - `JDG 08` builds on the now-complete JDG 01-05 pipeline to add regression coverage for score ordering and edge cases.
44
+ - `ENV 10` builds on the complete ENV 01-08 lifecycle to add comprehensive environment test coverage.
docs/kian/task_list.md CHANGED
@@ -11,8 +11,10 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
11
  - Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
12
  - `SCN 01` to `SCN 10` are now complete in the repo
13
  - The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
14
- - The next Kian-lane tasks are now `MOD 06`, `SCN 13`, `JDG 01`, `JDG 02`, `JDG 03`, and `ENV 01`
15
- - `MOD 05` and shared `AGT 05` now exist, so the judge and environment path can build on real scenario-grounded checks instead of placeholder rules
 
 
16
 
17
  ---
18
 
@@ -20,10 +22,9 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
20
 
21
  - [ ] **MOD 06** | Add semantic validators for impossible plans such as zero sample size with positive controls | 0.75h | Depends: MOD 05
22
  - [ ] **SCN 13** | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time-slot conflicts and duration | 1h | Depends: SCN 07
23
- - [ ] **JDG 01** | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | 1.25h | Depends: SCN 08
24
- - [ ] **JDG 02** | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | 1.25h | Depends: SCN 07, AGT 05
25
- - [ ] **JDG 03** | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | 1h | Depends: SCN 08
26
- - [ ] **ENV 01** | Create `ReplicaLabEnv` class skeleton | 0.5h | Depends: MOD 04, SCN 09
27
 
28
  ---
29
 
@@ -50,3 +51,21 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
50
  - [x] **SCN 08** | Completed by Person B (Ayush)
51
  - [x] **SCN 09** | Completed by Person B (Ayush)
52
  - [x] **SCN 10** | Completed by Person B (Ayush)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
12
  - `SCN 01` to `SCN 10` are now complete in the repo
13
  - The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
14
+ - `ENV 01` to `ENV 08` are now complete, so the full environment lifecycle (reset, step, validate, Lab Manager response, termination, judge scoring, state snapshot, close) works end-to-end
15
+ - `JDG 01` to `JDG 05` are now complete, so the deterministic reward pipeline (rigor, feasibility, fidelity, total reward formula, breakdown builder) is fully wired
16
+ - `TST 01` to `TST 05` are now complete, with 36 env tests and 26 reward tests passing
17
+ - The next Kian-lane tasks are `MOD 06`, `SCN 13`, `JDG 06`, `JDG 08`, `ENV 10`
18
 
19
  ---
20
 
 
22
 
23
  - [ ] **MOD 06** | Add semantic validators for impossible plans such as zero sample size with positive controls | 0.75h | Depends: MOD 05
24
  - [ ] **SCN 13** | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time-slot conflicts and duration | 1h | Depends: SCN 07
25
+ - [ ] **JDG 06** | Add optional plain English explanation function from reward breakdown | 0.75h | Depends: JDG 05
26
+ - [ ] **JDG 08** | Add score determinism tests and edge case tests | 1h | Depends: JDG 01 to JDG 05
27
+ - [ ] **ENV 10** | Add reset, step, invalid action, timeout, and deterministic replay tests | 1.25h | Depends: ENV 02 to ENV 09
 
28
 
29
  ---
30
 
 
51
  - [x] **SCN 08** | Completed by Person B (Ayush)
52
  - [x] **SCN 09** | Completed by Person B (Ayush)
53
  - [x] **SCN 10** | Completed by Person B (Ayush)
54
+ - [x] **ENV 01** | Completed by Person B (Ayush)
55
+ - [x] **ENV 02** | Completed by Person B (Ayush)
56
+ - [x] **ENV 03** | Completed by Person B (Ayush)
57
+ - [x] **ENV 04** | Completed by Person B (Ayush)
58
+ - [x] **ENV 05** | Completed by Person B (Ayush)
59
+ - [x] **ENV 06** | Completed by Person B (Ayush)
60
+ - [x] **ENV 07** | Completed by Person B (Ayush)
61
+ - [x] **ENV 08** | Completed by Person B (Ayush)
62
+ - [x] **JDG 01** | Completed by Person B (Ayush)
63
+ - [x] **JDG 02** | Completed by Person B (Ayush)
64
+ - [x] **JDG 03** | Completed by Person B (Ayush)
65
+ - [x] **JDG 04** | Completed by Person B (Ayush)
66
+ - [x] **JDG 05** | Completed by Person B (Ayush)
67
+ - [x] **TST 01** | Completed by Person B (Ayush)
68
+ - [x] **TST 02** | Completed by Person B (Ayush)
69
+ - [x] **TST 03** | Completed by Person B (Ayush)
70
+ - [x] **TST 04** | Completed by Person B (Ayush)
71
+ - [x] **TST 05** | Completed by Person B (Ayush)
docs/map/scoring.md CHANGED
@@ -1,19 +1,21 @@
1
  # Scoring Map — `replicalab/scoring/`
2
 
3
  > Judge scoring engine for protocol evaluation.
4
- > Pure deterministic functions — no LLM calls, no side effects.
5
  >
6
- > **Tasks implemented:** JDG 01, JDG 02, JDG 03
7
- > **Tasks remaining:** JDG 04-08
8
 
9
  ## Architecture
10
 
11
  ```
12
  replicalab/scoring/
13
- __init__.py # exports: score_rigor, score_feasibility, score_fidelity
 
14
  rigor.py # JDG 01 — protocol structural quality
15
  feasibility.py # JDG 02 — resource feasibility (wraps AGT 05)
16
  fidelity.py # JDG 03 — adherence to hidden reference spec
 
17
  ```
18
 
19
  ## Shared Utilities
@@ -142,15 +144,40 @@ This is the key difference from JDG 01's element check.
142
 
143
  ---
144
 
145
- ## Not Yet Implemented
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
- ### `compute_reward(protocol, scenario, ...) -> RewardBreakdown` — JDG 04/05
148
- Combines rigor + feasibility + fidelity with weights.
149
- Applies efficiency bonus (rounds used), communication bonus, and penalties.
 
 
 
 
 
 
 
 
150
 
151
  ### Bonuses & Penalties — JDG 06-08
152
- - `efficiency_bonus`: reward for finishing in fewer rounds
153
- - `communication_bonus`: reward for clear negotiation
154
  - `penalties`: policy violations, hallucinated resources, etc.
155
 
156
  ## Data Consumed
 
1
  # Scoring Map — `replicalab/scoring/`
2
 
3
  > Judge scoring engine for protocol evaluation.
4
+ > Pure deterministic functions — no model calls, no side effects.
5
  >
6
+ > **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05
7
+ > **Tasks remaining:** JDG 06-08
8
 
9
  ## Architecture
10
 
11
  ```
12
  replicalab/scoring/
13
+ __init__.py # exports: score_rigor, score_feasibility, score_fidelity,
14
+ # build_reward_breakdown, compute_total_reward
15
  rigor.py # JDG 01 — protocol structural quality
16
  feasibility.py # JDG 02 — resource feasibility (wraps AGT 05)
17
  fidelity.py # JDG 03 — adherence to hidden reference spec
18
+ rubric.py # JDG 04-05 — total reward formula and breakdown builder
19
  ```
20
 
21
  ## Shared Utilities
 
144
 
145
  ---
146
 
147
+ ---
148
+
149
+ ## JDG 04 — `compute_total_reward(breakdown) -> float`
150
+
151
+ **File:** `rubric.py`
152
+ **Formula:** `10 × rigor × feasibility × fidelity + efficiency_bonus + communication_bonus − sum(penalties)`
153
+
154
+ Returns a scalar reward from a `RewardBreakdown` object.
155
+
156
+ ## JDG 05 — `build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown`
157
+
158
+ **File:** `rubric.py`
159
+ **Composes:** rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus.
160
+
161
+ ### Efficiency Bonus
162
+ - Max bonus: 1.0 (configurable via `_MAX_EFFICIENCY_BONUS`)
163
+ - Formula: `bonus × (max_rounds - rounds_used) / (max_rounds - 1)`
164
+ - Finishing in round 1 of 6 → maximum bonus; using all rounds → 0
165
 
166
+ ### Internal Functions
167
+
168
+ | Function | Purpose |
169
+ |----------|---------|
170
+ | `compute_total_reward(breakdown)` | Apply the reward formula |
171
+ | `build_reward_breakdown(...)` | Compose all sub-scores into a breakdown |
172
+ | `_efficiency_bonus(rounds_used, max_rounds)` | Compute efficiency bonus |
173
+
174
+ ---
175
+
176
+ ## Not Yet Implemented
177
 
178
  ### Bonuses & Penalties — JDG 06-08
179
+ - `explanation_function`: optional plain English from reward breakdown (JDG 06)
180
+ - `communication_bonus`: reward for clear negotiation (reserved)
181
  - `penalties`: policy violations, hallucinated resources, etc.
182
 
183
  ## Data Consumed
docs/map/server.md CHANGED
@@ -1,128 +1,80 @@
1
- # Server Map `server/app.py`
2
 
3
- > FastAPI backend with REST + WebSocket endpoints and stub environment.
 
 
4
  >
5
- > **Tasks implemented:** API 01-04, 06 (partial)
6
 
7
- ## Environment
 
 
 
 
8
 
9
  ### `_StubEnv`
10
- Minimal environment stub used until the real `ReplicaLabEnv` is implemented (ENV 01-11).
11
-
12
- **State:**
13
- | Attribute | Type | Purpose |
14
- |-----------|------|---------|
15
- | `_state` | `EpisodeState` | Full episode state |
16
- | `_episode_id` | `str` | UUID for this episode |
17
- | `_scenario_pack` | `NormalizedScenarioPack \| None` | Stored for lab manager pipeline |
18
- | `_logs` | `list[ConversationEntry]` | Conversation transcript |
19
-
20
- **Methods:**
21
-
22
- | Method | Returns | Behavior |
23
- |--------|---------|----------|
24
- | `reset(seed, scenario, difficulty)` | `Observation` | Generates scenario, builds initial observations |
25
- | `step(action: ScientistAction)` | `StepResult` | Processes scientist action, runs lab manager pipeline |
26
- | `state()` | `EpisodeState` | Returns current state snapshot |
27
- | `episode_id()` | `str` | Returns episode UUID |
28
- | `close()` | `None` | No-op |
29
-
30
- **Lab Manager Integration (AGT 07):**
31
- The `_lab_manager_action()` method runs the full deterministic pipeline:
32
- 1. `check_feasibility(protocol, scenario_pack)` → `FeasibilityCheckResult`
33
- 2. `suggest_alternative(protocol, check_result, scenario_pack)` → `AlternativeSuggestion | None`
34
- 3. `compose_lab_manager_response(check_result, suggestion)` → `LabManagerAction`
35
-
36
- **Termination logic:**
37
- - Episode ends (`done=True`) when `agreement_reached=True` (both agents accept)
38
- - `agreement_reached` when lab manager action_type is `accept` (2-round stub logic)
39
- - On termination: reward = `STUB_ACCEPT_REWARD` (5.0)
40
-
41
- ### `_make_env() -> _StubEnv`
42
- Factory that tries to import `ReplicaLabEnv` from `replicalab.env`, falls back to `_StubEnv`.
43
-
44
- ## REST Endpoints
45
 
46
  ### `GET /health`
47
- Returns `{"status": "ok"}`.
 
48
 
49
  ### `POST /reset`
50
- **Request:** `ResetRequest`
51
- | Field | Type | Default |
52
- |-------|------|---------|
53
- | `seed` | `int \| None` | `None` (random) |
54
- | `scenario` | `str` | `DEFAULT_SCENARIO_TEMPLATE` |
55
- | `difficulty` | `str` | `DEFAULT_DIFFICULTY` |
56
- | `session_id` | `str \| None` | `None` (auto-generated) |
57
-
58
- **Response:** `ResetResponse`
59
- | Field | Type |
60
- |-------|------|
61
- | `session_id` | `str` |
62
- | `episode_id` | `str` |
63
- | `observation` | `Observation` |
64
 
65
- ### `POST /step`
66
- **Request:** `StepRequest`
67
- | Field | Type |
68
- |-------|------|
69
- | `session_id` | `str` |
70
- | `action` | `ScientistAction` |
71
 
72
- **Response:** `StepResult` (observation, reward, done, info)
 
73
 
74
- When `done=True`, the episode log is stored in `_replay_store`.
 
 
75
 
76
  ### `GET /scenarios`
77
- Returns `available_scenario_families()` list of families with difficulties.
78
 
79
  ### `GET /replay/{episode_id}`
80
- Returns `EpisodeLog` for a completed episode, or 404 if not found.
81
 
82
- ## WebSocket Endpoint
83
 
84
  ### `WS /ws`
85
- Bidirectional session with JSON messages.
86
-
87
- **Client → Server messages:**
88
- | Type | Payload | Behavior |
89
- |------|---------|----------|
90
- | `reset` | `{seed, scenario, difficulty}` | Creates env, returns initial state |
91
- | `step` | `{action: ScientistAction}` | Steps env, returns result |
92
- | `ping` | — | Returns `{"type": "pong"}` |
93
 
94
- **Server → Client messages:**
95
- | Type | Payload |
96
- |------|---------|
97
- | `state` | `{observation, episode_id}` |
98
- | `step_result` | `StepResult.info.model_dump()` |
99
- | `pong` | `{}` |
100
- | `error` | `{message}` |
101
 
102
- ## Session Management
103
 
104
- | Store | Type | Purpose |
105
- |-------|------|---------|
106
- | `_sessions` | `dict[str, dict]` | Active REST sessions (env + last_active) |
107
- | `_replay_store` | `dict[str, EpisodeLog]` | Completed episode logs |
108
 
109
- **Cleanup:** Background task runs every 60s, removes sessions older than `SESSION_TTL_SECONDS` (300s).
 
 
 
110
 
111
- ## Helper Functions
112
 
113
  | Function | Purpose |
114
- |----------|---------|
115
- | `_reward_breakdown_from_state(state)` | Extract RewardBreakdown from EpisodeState scores |
116
- | `_build_episode_log(episode_id, state)` | Build EpisodeLog from final state |
117
- | `_touch(session_id)` | Update last_active timestamp |
118
- | `_cleanup_stale_sessions()` | Remove expired sessions |
119
-
120
- ## Dependencies
121
-
122
- ```python
123
- from replicalab.agents import check_feasibility, compose_lab_manager_response, suggest_alternative
124
- from replicalab.config import API_HOST, API_PORT, DEFAULT_DIFFICULTY, ...
125
- from replicalab.models import (ConversationEntry, EpisodeLog, EpisodeState, LabManagerAction,
126
- Observation, Protocol, RewardBreakdown, ScientistAction, StepInfo, StepResult, ...)
127
- from replicalab.scenarios import NormalizedScenarioPack, available_scenario_families, generate_scenario
128
- ```
 
1
+ # Server Map - `server/app.py`
2
 
3
+ > FastAPI backend with REST + WebSocket endpoints. The normal path now uses
4
+ > the real `ReplicaLabEnv`; `_StubEnv` remains only as a fallback if the env
5
+ > package cannot be imported.
6
  >
7
+ > **Tasks implemented:** API 01-09, 13, 15
8
 
9
+ ## Environment path
10
+
11
+ ### `ReplicaLabEnv`
12
+ Primary environment implementation imported from
13
+ `replicalab.env.replicalab_env`.
14
 
15
  ### `_StubEnv`
16
+ Legacy fallback kept so the server can still boot if the real env import
17
+ fails. It is no longer the intended local or Docker runtime.
18
+
19
+ ### `_make_env()`
20
+ Factory that prefers `ReplicaLabEnv` and falls back to `_StubEnv` only on
21
+ import failure.
22
+
23
+ ## REST endpoints
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ### `GET /health`
26
+ Returns a liveness payload. When the real env path is active, the response
27
+ includes `env: "real"`.
28
 
29
  ### `POST /reset`
30
+ Starts a new episode and returns:
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ - `session_id`
33
+ - `episode_id`
34
+ - typed `Observation`
 
 
 
35
 
36
+ ### `POST /step`
37
+ Submits a typed `ScientistAction` and returns `StepResult`.
38
 
39
+ When `done=true`, the terminal `StepResult` is also used to build the replay
40
+ log so `reward_breakdown`, `judge_notes`, and `verdict` stay aligned with the
41
+ real env result.
42
 
43
  ### `GET /scenarios`
44
+ Returns the available scenario families and supported difficulties.
45
 
46
  ### `GET /replay/{episode_id}`
47
+ Returns the stored `EpisodeLog` for a completed episode or 404 if not found.
48
 
49
+ ## WebSocket endpoint
50
 
51
  ### `WS /ws`
52
+ Per-connection isolated environment session supporting:
 
 
 
 
 
 
 
53
 
54
+ - `reset`
55
+ - `step`
56
+ - `ping`
 
 
 
 
57
 
58
+ Idle timeout and disconnect cleanup are implemented and verified.
59
 
60
+ ## Session management
 
 
 
61
 
62
+ | Store | Purpose |
63
+ | --- | --- |
64
+ | `_sessions` | Active REST sessions |
65
+ | `_replay_store` | Completed episode logs |
66
 
67
+ ## Key helpers
68
 
69
  | Function | Purpose |
70
+ | --- | --- |
71
+ | `_build_episode_log(episode_id, state, result)` | Build replay log from final state and terminal step result |
72
+ | `_touch(session_id)` | Refresh REST session last-active timestamp |
73
+ | `_cleanup_stale_sessions()` | Remove expired REST sessions |
74
+
75
+ ## Current deployment state
76
+
77
+ - Local OpenEnv validation passes
78
+ - Local Docker build and run verification passes
79
+ - HF Spaces metadata is present in the root `README.md` and root `Dockerfile`
80
+ - Live hosted verification remains `API 10`
 
 
 
 
docs/map/tests.md CHANGED
@@ -1,8 +1,8 @@
1
  # Tests Map — `tests/`
2
 
3
- > 134 tests across 8 files. All passing.
4
  >
5
- > **Last verified:** 2026-03-07
6
 
7
  ## Summary
8
 
@@ -14,15 +14,17 @@
14
  | `test_validation.py` | 13 | Protocol validation checks |
15
  | `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
16
  | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
17
- | `test_reward.py` | 18 | JDG 01-03 scoring functions |
18
- | `test_server.py` | 5 | API endpoint integration |
19
- | **Total** | **134** | |
 
 
20
 
21
  ## Missing Coverage (not yet implemented)
22
 
23
  | File (planned) | Would cover |
24
  |---------------|-------------|
25
- | `test_env.py` | ENV 01-11 real environment |
26
 
27
  ---
28
 
@@ -182,6 +184,185 @@
182
  | `test_compose_lab_manager_response_reports_non_lab_issues` | Policy-only → REPORT |
183
  | `test_compose_lab_manager_response_uses_custom_renderer_without_changing_verdict` | Custom renderer works |
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  ## Test Helpers
186
 
187
  ### Shared fixtures in test files
@@ -192,3 +373,13 @@
192
  | `_base_observation(**overrides)` | test_scientist_policy | Build ScientistObservation with defaults |
193
  | `_make_system_prompt()` | test_scientist_policy | Build prompt from math_reasoning scenario |
194
  | `_VALID_REQUEST_INFO_JSON` | test_scientist_policy | Valid request_info JSON string |
 
 
 
 
 
 
 
 
 
 
 
1
  # Tests Map — `tests/`
2
 
3
+ > 231 tests across 10 files. All passing.
4
  >
5
+ > **Last verified:** 2026-03-08
6
 
7
  ## Summary
8
 
 
14
  | `test_validation.py` | 13 | Protocol validation checks |
15
  | `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
16
  | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
17
+ | `test_reward.py` | 26 | JDG 01-05 scoring functions |
18
+ | `test_env.py` | 36 | ENV 01-08, JDG 04-05, TST 01-03 |
19
+ | `test_server.py` | 34 | API endpoint integration (API 02-04, 06-07, 13) |
20
+ | `test_client.py` | 24 | TRN 13 client module (REST + WS transports) |
21
+ | **Total** | **231** | |
22
 
23
  ## Missing Coverage (not yet implemented)
24
 
25
  | File (planned) | Would cover |
26
  |---------------|-------------|
27
+ | `test_env.py` (expand) | ENV 10 full reset/step/replay tests |
28
 
29
  ---
30
 
 
184
  | `test_compose_lab_manager_response_reports_non_lab_issues` | Policy-only → REPORT |
185
  | `test_compose_lab_manager_response_uses_custom_renderer_without_changing_verdict` | Custom renderer works |
186
 
187
+ ## `test_reward.py` (18 tests)
188
+
189
+ | Test | What it verifies |
190
+ |------|-----------------|
191
+ | `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering |
192
+ | `test_rigor_is_deterministic` | Same inputs → same output |
193
+ | `test_rigor_empty_controls_reduces_score` | Controls matter |
194
+ | `test_rigor_short_rationale_reduces_score` | Rationale length matters |
195
+ | `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations |
196
+ | `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 |
197
+ | `test_feasibility_infeasible_protocol_scores_lower` | Bad < good |
198
+ | `test_feasibility_accepts_precomputed_check` | Pre-computed = computed |
199
+ | `test_feasibility_is_deterministic` | Same inputs → same output |
200
+ | `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over |
201
+ | `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations |
202
+ | `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned |
203
+ | `test_fidelity_is_deterministic` | Same inputs → same output |
204
+ | `test_fidelity_substitution_gets_partial_credit` | Sub > miss |
205
+ | `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps |
206
+ | `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations |
207
+ | `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check |
208
+ | `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency |
209
+
210
+ ## `test_env.py` (32 tests)
211
+
212
+ ### TST 01 — Reset (8 tests)
213
+ | Test | What it verifies |
214
+ |------|-----------------|
215
+ | `test_reset_returns_observation_with_both_roles` | Both scientist + lab_manager present |
216
+ | `test_reset_scientist_fields_populated` | Paper title, hypothesis, goal, round 0 |
217
+ | `test_reset_lab_manager_fields_populated` | Budget, staff, time limit populated |
218
+ | `test_reset_preserves_booked_and_out_of_stock` | ENV 02 scenario-pack data preserved |
219
+ | `test_reset_state_round_zero` | State starts at round 0, not done |
220
+ | `test_reset_generates_episode_id` | UUID episode ID generated |
221
+ | `test_reset_clears_previous_episode` | Second reset clears first episode |
222
+ | `test_reset_all_templates_and_difficulties` | All 9 template/difficulty combos work |
223
+
224
+ ### TST 03 — Invalid Action (4 tests)
225
+ | Test | What it verifies |
226
+ |------|-----------------|
227
+ | `test_invalid_duration_returns_error_string` | Validation error returned |
228
+ | `test_env_survives_after_invalid_action` | Env still accepts valid actions after error |
229
+ | `test_invalid_action_does_not_advance_round` | Round stays at 0 |
230
+ | `test_request_info_always_passes_validation` | Non-proposal actions skip validation |
231
+
232
+ ### TST 02 — Step and Terminal Path (8 tests)
233
+ | Test | What it verifies |
234
+ |------|-----------------|
235
+ | `test_step_advances_round_number` | Round increments |
236
+ | `test_step_returns_observations` | Both roles in step result |
237
+ | `test_step_records_conversation_history` | Scientist + LM entries logged |
238
+ | `test_accept_with_protocol_terminates` | Accept → done=True |
239
+ | `test_accept_terminal_step_has_real_reward` | ENV 06 real scores, not stub 0.8 |
240
+ | `test_max_rounds_terminates` | Max rounds → done, no agreement |
241
+ | `test_step_info_has_round_and_episode_id` | Metadata populated |
242
+ | `test_full_episode_propose_then_accept` | Full 2-step episode |
243
+
244
+ ### ENV 07 — State Snapshot (2 tests)
245
+ | Test | What it verifies |
246
+ |------|-----------------|
247
+ | `test_state_is_deep_copy` | Mutating snapshot doesn't affect env |
248
+ | `test_state_history_is_independent` | History list is independent copy |
249
+
250
+ ### ENV 08 — Close/Reopen (3 tests)
251
+ | Test | What it verifies |
252
+ |------|-----------------|
253
+ | `test_close_is_idempotent` | Double close doesn't throw |
254
+ | `test_step_after_close_raises` | RuntimeError on step after close |
255
+ | `test_reset_reopens_closed_env` | Reset clears closed state |
256
+
257
+ ### JDG 04-05 — Rubric (7 tests)
258
+ | Test | What it verifies |
259
+ |------|-----------------|
260
+ | `test_compute_total_reward_formula` | 10×r×f×fi + bonuses = expected |
261
+ | `test_compute_total_reward_with_penalties` | Penalties subtracted correctly |
262
+ | `test_compute_total_reward_zero_scores` | Zero dimension → zero reward |
263
+ | `test_build_reward_breakdown_returns_valid_scores` | All sub-scores in [0,1] |
264
+ | `test_build_reward_breakdown_efficiency_bonus` | Fewer rounds → higher bonus |
265
+ | `test_build_reward_breakdown_is_deterministic` | Same inputs → same output |
266
+ | `test_total_reward_matches_manual_calculation` | Cross-check formula |
267
+
268
+ ## `test_server.py` (34 tests)
269
+
270
+ ### GET /scenarios — API 04 (5 tests)
271
+ | Test | What it verifies |
272
+ |------|-----------------|
273
+ | `test_returns_200` | Endpoint returns 200 |
274
+ | `test_response_has_scenarios_key` | Response has `scenarios` list |
275
+ | `test_all_families_present` | All 3 families present |
276
+ | `test_each_family_has_difficulties` | Each has easy/medium/hard |
277
+ | `test_no_extra_keys` | Only `family` and `difficulties` keys |
278
+
279
+ ### CORS — API 13 (3 tests)
280
+ | Test | What it verifies |
281
+ |------|-----------------|
282
+ | `test_preflight_allows_localhost_vite_origin` | localhost:5173 allowed |
283
+ | `test_preflight_allows_hf_space_origin` | HF Spaces origin allowed |
284
+ | `test_preflight_rejects_unconfigured_origin` | Unknown origin → 400 |
285
+
286
+ ### POST /reset — API 02 (7 tests)
287
+ | Test | What it verifies |
288
+ |------|-----------------|
289
+ | `test_reset_returns_200_with_expected_keys` | 200 with session_id, episode_id, observation |
290
+ | `test_reset_observation_has_both_roles` | Scientist + lab_manager present |
291
+ | `test_reset_with_explicit_session_id_reuses_slot` | Same session_id reused |
292
+ | `test_reset_reuse_closes_prior_env` | New episode on reuse |
293
+ | `test_reset_default_params` | Defaults work without error |
294
+ | `test_reset_custom_scenario_and_difficulty` | All 9 combos succeed |
295
+ | `test_reset_deterministic_with_same_seed` | Same seed → same observation |
296
+
297
+ ### POST /step — API 03 (5 tests)
298
+ | Test | What it verifies |
299
+ |------|-----------------|
300
+ | `test_reset_then_step_happy_path` | Reset → step returns 200 with StepResult |
301
+ | `test_step_invalid_session_returns_404` | Non-existent session → 404 |
302
+ | `test_terminal_step_returns_real_reward_breakdown` | Accept has real scores, not stub 0.8 |
303
+ | `test_semantic_invalid_action_returns_200_with_error` | Invalid duration → 200 with info.error |
304
+ | `test_replay_uses_real_judge_data` | Replay has real judge_notes, not stub |
305
+
306
+ ### WebSocket — API 06 (12 tests)
307
+ | Test | What it verifies |
308
+ |------|-----------------|
309
+ | `test_ws_ping_pong` | Ping → pong |
310
+ | `test_ws_reset_returns_observation` | Reset returns episode_id + observation |
311
+ | `test_ws_step_returns_result` | Step returns step_ok with result |
312
+ | `test_ws_full_episode_real_reward` | Propose → accept returns real scores |
313
+ | `test_ws_invalid_json` | Bad JSON → error |
314
+ | `test_ws_missing_action_field` | Missing action → error |
315
+ | `test_ws_invalid_action_payload` | Invalid action schema → error |
316
+ | `test_ws_unknown_message_type` | Unknown type → error |
317
+ | `test_ws_session_isolation` | Two connections have independent env state |
318
+ | `test_ws_semantic_invalid_action_returns_step_ok_with_info_error` | Invalid duration → step_ok with info.error |
319
+ | `test_ws_timeout_verdict` | Max rounds → done, timeout verdict |
320
+ | `test_ws_terminal_episode_persists_real_replay_log` | WS episode → /replay has real data |
321
+
322
+ ### WebSocket Idle Timeout — API 07 (2 tests)
323
+ | Test | What it verifies |
324
+ |------|-----------------|
325
+ | `test_ws_idle_timeout_closes_connection` | No messages → server closes with code 1000 |
326
+ | `test_ws_env_closes_on_disconnect` | env.close() called in finally block on disconnect |
327
+
328
+ ## `test_client.py` (24 tests)
329
+
330
+ ### REST Transport (10 tests)
331
+ | Test | What it verifies |
332
+ |------|-----------------|
333
+ | `test_connect_succeeds` | REST connect hits /health |
334
+ | `test_connect_bad_url_raises` | Bad URL raises |
335
+ | `test_reset_returns_observation` | reset() returns typed Observation |
336
+ | `test_reset_sets_session_and_episode_id` | IDs set after reset |
337
+ | `test_reset_reuses_session` | Same session_id on re-reset |
338
+ | `test_step_returns_step_result` | step() returns typed StepResult |
339
+ | `test_step_before_reset_raises` | step() without reset raises |
340
+ | `test_full_episode_propose_accept` | Full episode with reward > 0 |
341
+ | `test_replay_after_episode` | replay() returns typed EpisodeLog |
342
+ | `test_context_manager_closes` | `with` block sets connected=False |
343
+
344
+ ### WebSocket Transport (11 tests)
345
+ | Test | What it verifies |
346
+ |------|-----------------|
347
+ | `test_connect_succeeds` | WS connect opens connection |
348
+ | `test_connect_bad_url_raises` | Bad URL raises |
349
+ | `test_reset_returns_observation` | reset() returns typed Observation |
350
+ | `test_reset_sets_episode_id` | episode_id set after reset |
351
+ | `test_ws_session_id_is_none` | WS has no session_id |
352
+ | `test_step_returns_step_result` | step() returns typed StepResult |
353
+ | `test_full_episode_propose_accept` | Full episode with reward > 0 |
354
+ | `test_semantic_invalid_action_step_ok_with_error` | Invalid action → info.error |
355
+ | `test_context_manager_closes` | `with` block sets connected=False |
356
+ | `test_state_not_supported` | state() raises NotImplementedError |
357
+ | `test_replay_not_supported` | replay() raises NotImplementedError |
358
+
359
+ ### Constructor (3 tests)
360
+ | Test | What it verifies |
361
+ |------|-----------------|
362
+ | `test_unknown_transport_raises` | "grpc" → ValueError |
363
+ | `test_not_connected_raises_on_reset` | reset() without connect raises |
364
+ | `test_default_transport_is_websocket` | Default is _WsTransport |
365
+
366
  ## Test Helpers
367
 
368
  ### Shared fixtures in test files
 
373
  | `_base_observation(**overrides)` | test_scientist_policy | Build ScientistObservation with defaults |
374
  | `_make_system_prompt()` | test_scientist_policy | Build prompt from math_reasoning scenario |
375
  | `_VALID_REQUEST_INFO_JSON` | test_scientist_policy | Valid request_info JSON string |
376
+ | `_scenario(template, difficulty)` | test_env | Generate scenario with seed=42 |
377
+ | `_good_action(scenario)` | test_env | Build valid propose_protocol action |
378
+ | `_accept_action()` | test_env | Build valid accept action |
379
+ | `_good_protocol(scenario)` | test_env | Build well-formed protocol |
380
+ | `_reset(client, **kwargs)` | test_server | Reset and return response JSON |
381
+ | `_good_action_payload(client)` | test_server | Build valid propose_protocol payload |
382
+ | `_accept_action_payload()` | test_server | Build valid accept payload |
383
+ | `_propose_action(obs)` | test_client | Build valid propose_protocol ScientistAction |
384
+ | `_accept_action()` | test_client | Build valid accept ScientistAction |
385
+ | `live_server` | test_client | Module-scoped uvicorn server fixture |
docs/max/deployment.md CHANGED
@@ -28,7 +28,7 @@ curl http://localhost:7860/health
28
 
29
  curl -X POST http://localhost:7860/reset \
30
  -H "Content-Type: application/json" \
31
- -d '{"seed": 42, "scenario": "cell_biology", "difficulty": "easy"}'
32
  ```
33
 
34
  ---
@@ -40,6 +40,30 @@ docker build -f server/Dockerfile -t replicalab .
40
  docker run -p 7860:7860 replicalab
41
  ```
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  With optional hosted-model secrets:
44
 
45
  ```bash
@@ -50,27 +74,138 @@ docker run -p 7860:7860 \
50
 
51
  ---
52
 
53
- ## Hosted Space Deployment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- The repository is not yet marked as fully deployed. Use this section as the deployment checklist for the later API 09, API 10, API 15, and API 17 tasks.
 
 
 
 
 
 
 
 
 
 
 
56
 
57
- ### One-time setup
 
58
 
59
- 1. Create a Space with Docker support.
60
- 2. Add the Space as a remote.
61
- 3. Push the repository once the Docker path and README metadata are finalized.
62
- 4. Verify `/health`, `/reset`, and `/ws` after the Space build finishes.
63
 
64
- ### Secrets checklist
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- If the deployed server needs hosted-model credentials later, set them in the platform secret store rather than committing them to the repo.
 
 
67
 
68
- Suggested secret names:
69
 
70
- | Secret name | Purpose |
71
- |-------------|---------|
72
- | `MODEL_API_KEY` | Hosted model access key |
73
- | `MODEL_BASE_URL` | Optional alternate provider endpoint |
 
 
 
74
 
75
  ---
76
 
@@ -105,7 +240,7 @@ When hosted deployment is eventually verified:
105
 
106
  | Issue | Fix |
107
  |-------|-----|
108
- | `ReplicaLabEnv not found` warning at startup | Normal while the real env implementation has not landed; the server will use the stub env |
109
  | Docker build fails | Re-check `server/requirements.txt` and the Docker build context |
110
  | CORS error from the frontend | Re-check allowed origins in `server/app.py` |
111
  | WebSocket closes after idle time | Send periodic ping messages or reconnect |
 
28
 
29
  curl -X POST http://localhost:7860/reset \
30
  -H "Content-Type: application/json" \
31
+ -d '{"seed": 42, "scenario": "math_reasoning", "difficulty": "easy"}'
32
  ```
33
 
34
  ---
 
40
  docker run -p 7860:7860 replicalab
41
  ```
42
 
43
+ ### Verified endpoints (API 08 sign-off, 2026-03-08)
44
+
45
+ After `docker run -p 7860:7860 replicalab`, the following were verified
46
+ against the **real env** (not stub):
47
+
48
+ ```bash
49
+ curl http://localhost:7860/health
50
+ # → {"status":"ok","env":"real"}
51
+
52
+ curl http://localhost:7860/scenarios
53
+ # → {"scenarios":[{"family":"math_reasoning",...}, ...]}
54
+
55
+ curl -X POST http://localhost:7860/reset \
56
+ -H "Content-Type: application/json" \
57
+ -d '{"seed":42,"scenario":"math_reasoning","difficulty":"easy"}'
58
+ # → {"session_id":"...","episode_id":"...","observation":{...}}
59
+
60
+ # Use session_id from reset response:
61
+ curl -X POST http://localhost:7860/step \
62
+ -H "Content-Type: application/json" \
63
+ -d '{"session_id":"<SESSION_ID>","action":{"action_type":"propose_protocol","sample_size":3,"controls":["baseline"],"technique":"algebraic_proof","duration_days":1,"required_equipment":[],"required_reagents":[],"questions":[],"rationale":"Test."}}'
64
+ # → {"observation":{...},"reward":0.0,"done":false,"info":{...}}
65
+ ```
66
+
67
  With optional hosted-model secrets:
68
 
69
  ```bash
 
74
 
75
  ---
76
 
77
+ ## Hugging Face Spaces Deployment
78
+
79
+ ### What is already configured (API 09)
80
+
81
+ The repo is now deployment-ready for HF Spaces:
82
+
83
+ - **Root `Dockerfile`** — HF Spaces requires the Dockerfile at repo root.
84
+ The root-level `Dockerfile` is identical to `server/Dockerfile`. Keep them
85
+ in sync, or delete `server/Dockerfile` once the team standardizes.
86
+ - **`README.md` frontmatter** — The root README now contains the required
87
+ YAML frontmatter that HF Spaces parses on push:
88
+ ```yaml
89
+ ---
90
+ title: ReplicaLab
91
+ emoji: 🧪
92
+ colorFrom: blue
93
+ colorTo: green
94
+ sdk: docker
95
+ app_port: 7860
96
+ pinned: false
97
+ ---
98
+ ```
99
+ - **Non-root user** — The Dockerfile creates and runs as `appuser` (UID 1000),
100
+ which HF Spaces requires for security.
101
+ - **Port 7860** — Both the `EXPOSE` directive and the `uvicorn` CMD use 7860,
102
+ matching the `app_port` in the frontmatter.
103
+
104
+ ### Step-by-step deployment (for Max)
105
+
106
+ #### 1. Create the Space
107
+
108
+ 1. Go to https://huggingface.co/new-space
109
+ 2. Fill in:
110
+ - **Owner:** your HF username or the team org
111
+ - **Space name:** `replicalab` (or `replicalab-demo`)
112
+ - **License:** MIT
113
+ - **SDK:** Docker
114
+ - **Hardware:** CPU Basic (free tier is fine for the server)
115
+ - **Visibility:** Public
116
+ 3. Click **Create Space**
117
+
118
+ #### 2. Add the Space as a git remote
119
+
120
+ ```bash
121
+ # From the repo root
122
+ git remote add hf https://huggingface.co/spaces/<YOUR_HF_USERNAME>/replicalab
123
+
124
+ # If the org is different:
125
+ # git remote add hf https://huggingface.co/spaces/<ORG>/replicalab
126
+ ```
127
+
128
+ #### 3. Push the repo
129
+
130
+ ```bash
131
+ # Push the current branch to the Space
132
+ git push hf ayush:main
133
+
134
+ # Or if deploying from master:
135
+ # git push hf master:main
136
+ ```
137
+
138
+ HF Spaces will automatically detect the `Dockerfile`, build the image, and
139
+ start the container.
140
+
141
+ #### 4. Monitor the build
142
+
143
+ 1. Go to https://huggingface.co/spaces/\<YOUR_HF_USERNAME\>/replicalab
144
+ 2. Click the **Logs** tab (or **Build** tab during first deploy)
145
+ 3. Wait for the build to complete (typically 2-5 minutes)
146
+ 4. The Space status should change from "Building" to "Running"
147
+
148
+ #### 5. Verify the deployment (API 10 scope)
149
+
150
+ Once the Space is running:
151
+
152
+ ```bash
153
+ # Health check
154
+ curl https://<space-name>.hf.space/health
155
+
156
+ # Reset an episode
157
+ curl -X POST https://<space-name>.hf.space/reset \
158
+ -H "Content-Type: application/json" \
159
+ -d '{"seed": 42, "scenario": "math_reasoning", "difficulty": "easy"}'
160
 
161
+ # List scenarios
162
+ curl https://<space-name>.hf.space/scenarios
163
+ ```
164
+
165
+ WebSocket test (using websocat or wscat):
166
+ ```bash
167
+ wscat -c wss://<space-name>.hf.space/ws
168
+ # Then type: {"type": "ping"}
169
+ # Expect: {"type": "pong"}
170
+ ```
171
+
172
+ ### Secrets configuration
173
 
174
+ If the deployed server needs hosted-model credentials later (e.g. for a
175
+ frontier evaluator), set them in the HF Space secret store:
176
 
177
+ 1. Go to the Space **Settings** tab
178
+ 2. Scroll to **Repository secrets**
179
+ 3. Add each secret:
 
180
 
181
+ | Secret name | Purpose | Required now? |
182
+ |-------------|---------|---------------|
183
+ | `MODEL_API_KEY` | Hosted model access key (for frontier evaluator) | No — only for demo-time evaluator |
184
+ | `MODEL_BASE_URL` | Optional alternate provider endpoint | No |
185
+
186
+ Secrets are injected as environment variables at container runtime.
187
+ Access them in Python with `os.environ.get("MODEL_API_KEY")`.
188
+
189
+ ### Re-deploying after code changes
190
+
191
+ ```bash
192
+ # Just push again — HF rebuilds automatically
193
+ git push hf ayush:main
194
+ ```
195
 
196
+ To force a full rebuild (e.g. after dependency changes):
197
+ 1. Go to Space **Settings**
198
+ 2. Click **Factory reboot** under the Danger zone section
199
 
200
+ ### Known limitations
201
 
202
+ - **Free CPU tier** has 2 vCPU and 16 GB RAM. This is sufficient for the
203
+ FastAPI server but NOT for running RL training. Training happens in Colab.
204
+ - **Cold starts** Free-tier Spaces sleep after 48 hours of inactivity.
205
+ The first request after sleep takes 30-60 seconds to rebuild.
206
+ - **Persistent storage** — Episode replays and logs are in-memory only.
207
+ They reset when the container restarts. This is acceptable for the
208
+ hackathon demo.
209
 
210
  ---
211
 
 
240
 
241
  | Issue | Fix |
242
  |-------|-----|
243
+ | `ReplicaLabEnv not found` warning at startup | The real env is now available; ensure `replicalab/scoring/rubric.py` is present and `httpx` + `websocket-client` are in `server/requirements.txt` |
244
  | Docker build fails | Re-check `server/requirements.txt` and the Docker build context |
245
  | CORS error from the frontend | Re-check allowed origins in `server/app.py` |
246
  | WebSocket closes after idle time | Send periodic ping messages or reconnect |
docs/max/task_breakdown.md CHANGED
@@ -10,29 +10,34 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
10
  - Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
11
  - `FND 03` and `FND 12` are now complete via the validated frontend import from Kush's branch onto `ayush`
12
  - `FND 11` is now complete and verified
13
- - A normalized backend import from Max's PR is on `ayush`: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md`
14
- - That backend import is intentionally tracked as partial because it still runs on the stub env and Docker has not yet been validated locally
15
- - Max's remaining implementation priority is the real-env-backed API and deployment path
 
16
 
17
  ---
18
 
19
  ## Unblocked now
20
 
21
- 1. Convert the stub-backed API tasks to real-env-backed implementations once Kian lands the environment work
22
- 2. Validate Docker locally once the real env path is in place
 
23
 
24
  ---
25
 
26
  ## Still blocked
27
 
28
  - `FND 13` is now unblocked because `FND 03` is complete, but it remains owned by Kush (Person D)
29
- - Real completion of `API 01`, `API 02`, `API 03`, `API 06`, and `API 07` depends on Kian's environment tasks
30
- - Real completion of `API 08` depends on local Docker build and run validation
 
 
31
 
32
  ---
33
 
34
  ## Recommended execution order
35
 
36
- 1. Re-validate the imported server scaffold against Kian's environment implementation
37
- 2. Validate `server/Dockerfile` locally
38
- 3. Continue into deployment and replay work once the real env path is stable
 
 
10
  - Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
11
  - `FND 03` and `FND 12` are now complete via the validated frontend import from Kush's branch onto `ayush`
12
  - `FND 11` is now complete and verified
13
+ - The backend path is now real-env-backed locally: `server/app.py` imports `ReplicaLabEnv`, `openenv validate` passes, and local Docker verification is complete
14
+ - `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 09`, `API 13`, and `API 15` are complete
15
+ - `API 01`, `API 14`, and `OBS 02` are the remaining partial tasks in Max's lane
16
+ - Max's remaining implementation priority is live Space deployment, replay persistence, observability, and the remaining API polish
17
 
18
  ---
19
 
20
  ## Unblocked now
21
 
22
+ 1. `API 10` is now unblocked because HF metadata and deployment instructions are in place
23
+ 2. `API 17` is now unblocked because `API 09` is complete
24
+ 3. Replay or persistence work (`MOD 07`, `ENV 09`, `API 05`, `JDG 07`) is now the next infra-heavy backend chain
25
 
26
  ---
27
 
28
  ## Still blocked
29
 
30
  - `FND 13` is now unblocked because `FND 03` is complete, but it remains owned by Kush (Person D)
31
+ - `API 05` still depends on `ENV 09`
32
+ - `API 16` still depends on `UI 10`
33
+ - `API 18` still depends on `API 05` and `ENV 11`
34
+ - `API 19` still depends on `API 10`
35
 
36
  ---
37
 
38
  ## Recommended execution order
39
 
40
+ 1. Ship `API 10` live Space deployment verification
41
+ 2. Ship `API 17` secrets and hosted-key documentation
42
+ 3. Finish `API 01`, `API 14`, and `OBS 02` sign-off work
43
+ 4. Move into replay persistence (`MOD 07` -> `ENV 09` -> `API 05`)
docs/max/task_list.md CHANGED
@@ -6,22 +6,20 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
6
 
7
  ## Current status
8
 
9
- - `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are complete
10
- - All five were executed by `Person B (Ayush)` and recorded as executor deviations
11
- - `FND 03` is complete via the validated frontend import from Kush's branch onto `ayush`
12
- - `FND 11` is complete
13
- - `FND 12` is complete via the imported and validated `frontend/vite.config.ts`
14
- - A stub-backed backend server scaffold now exists in `server/app.py`
15
- - `API 01`, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 13`, `API 14`, and `OBS 02` are partial pending real-env and Docker-level verification
16
- - The remaining Max work is now the API, Docker, deployment, replay, and observability path
17
 
18
  ---
19
 
20
  ## Immediate next tasks
21
 
22
- - [ ] **API 01 / API 02 / API 03 / API 06** | Convert the stub-backed server scaffold into real-env-backed endpoints | Depends: `ENV 01`, `ENV 02`, `ENV 06` | Status: partial
23
- - [ ] **API 08** | Validate Docker locally for the server image | Depends: `API 01` to `API 07` | Status: partial
24
- - [ ] **OBS 02** | Confirm logging behavior against the integrated environment path | Depends: `API 01` | Status: partial
 
25
 
26
  ---
27
 
@@ -38,4 +36,14 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
38
  ## Completed in Max's lane
39
 
40
  - [x] **FND 11** | Completed and verified in `server/requirements.txt`
 
 
 
 
 
 
 
 
 
 
41
 
 
6
 
7
  ## Current status
8
 
9
+ - `FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 11`, and `FND 12` are complete
10
+ - The server now runs against the real `ReplicaLabEnv`, not just the legacy stub fallback
11
+ - `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 09`, `API 13`, and `API 15` are complete
12
+ - `API 01`, `API 14`, and `OBS 02` remain partial
13
+ - The remaining Max work is now live deployment verification, replay or persistence work, observability, and the rest of the API or packaging path
 
 
 
14
 
15
  ---
16
 
17
  ## Immediate next tasks
18
 
19
+ - [ ] **API 10** | Deploy the live HF Space and verify `/health`, `/reset`, and `/step` end to end | Depends: `API 09`
20
+ - [ ] **API 17** | Document secrets and API key management for HF Space and Colab | Depends: `API 09`
21
+ - [ ] **API 01 / API 14 / OBS 02** | Finish the remaining partial server tasks and sign-offs | Depends: real-env server path already present
22
+ - [ ] **MOD 07 / ENV 09 / API 05** | Finish replay persistence and replay retrieval path | Depends: `MOD 04`, `ENV 06`
23
 
24
  ---
25
 
 
36
  ## Completed in Max's lane
37
 
38
  - [x] **FND 11** | Completed and verified in `server/requirements.txt`
39
+ - [x] **API 02** | Completed by Person B (Ayush)
40
+ - [x] **API 03** | Completed by Person B (Ayush)
41
+ - [x] **API 04** | Completed by Person B (Ayush)
42
+ - [x] **API 06** | Completed by Person B (Ayush)
43
+ - [x] **API 07** | Completed by Person B (Ayush)
44
+ - [x] **API 08** | Completed by Person B (Ayush)
45
+ - [x] **API 09** | Completed by Person B (Ayush)
46
+ - [x] **API 13** | Completed by Person B (Ayush)
47
+ - [x] **API 15** | Completed by Person B (Ayush)
48
+ - [x] **TST 07** | Completed by Person B (Ayush)
49
 
replicalab/__init__.py CHANGED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ from replicalab.client import ReplicaLabClient
2
+
3
+ __all__ = ["ReplicaLabClient"]
replicalab/agents/scientist_policy.py CHANGED
@@ -5,7 +5,7 @@ MOD 09 introduced strict parsing from raw model output into
5
  builder so prompt assembly can be driven by the normalized scenario pack
6
  instead of hard-coded domain text. AGT 02 adds the per-turn observation
7
  formatter that converts a ``ScientistObservation`` into the user message
8
- sent to the LLM each round. AGT 03 wraps the formatter and parser in a
9
  retry loop with error-specific correction prompts and exposed telemetry.
10
  AGT 04 adds a deterministic baseline Scientist so smoke tests can run
11
  without a trained model.
@@ -139,7 +139,7 @@ def call_scientist_with_retry(
139
  *,
140
  max_retries: int = 2,
141
  ) -> ScientistCallResult:
142
- """Call an LLM to produce a ``ScientistAction`` with parser-driven retries.
143
 
144
  On parse failure the error is fed back to the model as a correction
145
  prompt and the model is asked to try again, up to *max_retries* times.
@@ -279,6 +279,18 @@ def build_scientist_system_prompt(
279
  "For accept, questions must be empty and protocol-edit fields must stay "
280
  "empty or zero."
281
  ),
 
 
 
 
 
 
 
 
 
 
 
 
282
  ]
283
 
284
  return "\n\n".join(section for section in sections if section)
@@ -344,7 +356,7 @@ def format_scientist_observation(obs: ScientistObservation) -> str:
344
  def build_baseline_scientist_action(
345
  observation: ScientistObservation,
346
  ) -> ScientistAction:
347
- """Return a deterministic non-LLM Scientist action for smoke tests.
348
 
349
  The baseline follows a conservative policy:
350
  - propose a valid protocol when no protocol exists yet
 
5
  builder so prompt assembly can be driven by the normalized scenario pack
6
  instead of hard-coded domain text. AGT 02 adds the per-turn observation
7
  formatter that converts a ``ScientistObservation`` into the user message
8
+ sent to the model each round. AGT 03 wraps the formatter and parser in a
9
  retry loop with error-specific correction prompts and exposed telemetry.
10
  AGT 04 adds a deterministic baseline Scientist so smoke tests can run
11
  without a trained model.
 
139
  *,
140
  max_retries: int = 2,
141
  ) -> ScientistCallResult:
142
+ """Call a model backend to produce a ``ScientistAction`` with parser-driven retries.
143
 
144
  On parse failure the error is fed back to the model as a correction
145
  prompt and the model is asked to try again, up to *max_retries* times.
 
279
  "For accept, questions must be empty and protocol-edit fields must stay "
280
  "empty or zero."
281
  ),
282
+ (
283
+ "Bounded tool policy: you have access to three bounded tools. "
284
+ "search_evidence retrieves supporting facts from frozen evidence packs. "
285
+ "run_code_check performs bounded code analysis, config validation, and "
286
+ "derived-value computation. "
287
+ "inspect_image extracts information from figures, tables, charts, and "
288
+ "screenshots. "
289
+ "Rules: use tools only to support or verify claims within the current "
290
+ "scenario constraints. Tools do not override constraints, loosen limits, "
291
+ "or reveal hidden ground truth. No unrestricted web browsing. No audio "
292
+ "capabilities. No autonomous code execution beyond bounded analysis."
293
+ ),
294
  ]
295
 
296
  return "\n\n".join(section for section in sections if section)
 
356
  def build_baseline_scientist_action(
357
  observation: ScientistObservation,
358
  ) -> ScientistAction:
359
+ """Return a deterministic baseline Scientist action for smoke tests.
360
 
361
  The baseline follows a conservative policy:
362
  - propose a valid protocol when no protocol exists yet
replicalab/client.py ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Reusable environment client for ReplicaLab (TRN 13).
2
+
3
+ Wraps both REST and WebSocket server transports behind a unified
4
+ sync interface. Consumers (notebook, training loop, eval scripts)
5
+ import this module instead of duplicating connection logic.
6
+
7
+ Usage::
8
+
9
+ from replicalab.client import ReplicaLabClient
10
+
11
+ with ReplicaLabClient("http://localhost:7860", transport="websocket") as client:
12
+ obs = client.reset(seed=42, scenario="math_reasoning", difficulty="easy")
13
+ while True:
14
+ action = policy(obs)
15
+ result = client.step(action)
16
+ obs = result.observation
17
+ if result.done:
18
+ break
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import json
24
+ import threading
25
+ from typing import Optional
26
+
27
+ import httpx
28
+ import websocket as ws_lib # websocket-client
29
+
30
+ from replicalab.config import (
31
+ API_PORT,
32
+ DEFAULT_DIFFICULTY,
33
+ DEFAULT_SCENARIO_TEMPLATE,
34
+ )
35
+ from replicalab.models import (
36
+ EpisodeLog,
37
+ Observation,
38
+ ScientistAction,
39
+ StepInfo,
40
+ StepResult,
41
+ )
42
+
43
+ __all__ = ["ReplicaLabClient"]
44
+
45
+ # ---------------------------------------------------------------------------
46
+ # Transport backends
47
+ # ---------------------------------------------------------------------------
48
+
49
+
50
+ class _RestTransport:
51
+ """Sync REST transport using httpx."""
52
+
53
+ def __init__(self, base_url: str, timeout: float) -> None:
54
+ self._base_url = base_url.rstrip("/")
55
+ self._http = httpx.Client(base_url=self._base_url, timeout=timeout)
56
+ self._session_id: Optional[str] = None
57
+ self._episode_id: Optional[str] = None
58
+
59
+ # -- lifecycle -----------------------------------------------------------
60
+
61
+ def connect(self) -> None:
62
+ resp = self._http.get("/health")
63
+ resp.raise_for_status()
64
+
65
+ def close(self) -> None:
66
+ self._session_id = None
67
+ self._episode_id = None
68
+ self._http.close()
69
+
70
+ # -- env operations ------------------------------------------------------
71
+
72
+ def reset(
73
+ self,
74
+ seed: int,
75
+ scenario: str,
76
+ difficulty: str,
77
+ ) -> Observation:
78
+ payload: dict = {
79
+ "seed": seed,
80
+ "scenario": scenario,
81
+ "difficulty": difficulty,
82
+ }
83
+ if self._session_id is not None:
84
+ payload["session_id"] = self._session_id
85
+
86
+ resp = self._http.post("/reset", json=payload)
87
+ resp.raise_for_status()
88
+ data = resp.json()
89
+ self._session_id = data["session_id"]
90
+ self._episode_id = data["episode_id"]
91
+ return Observation.model_validate(data["observation"])
92
+
93
+ def step(self, action: ScientistAction) -> StepResult:
94
+ if self._session_id is None:
95
+ raise RuntimeError("Call reset() before step()")
96
+ resp = self._http.post(
97
+ "/step",
98
+ json={
99
+ "session_id": self._session_id,
100
+ "action": action.model_dump(),
101
+ },
102
+ )
103
+ resp.raise_for_status()
104
+ return StepResult.model_validate(resp.json())
105
+
106
+ def state(self) -> dict:
107
+ if self._session_id is None:
108
+ raise RuntimeError("Call reset() before state()")
109
+ resp = self._http.get(f"/state/{self._session_id}")
110
+ resp.raise_for_status()
111
+ return resp.json()
112
+
113
+ def replay(self, episode_id: str) -> EpisodeLog:
114
+ resp = self._http.get(f"/replay/{episode_id}")
115
+ resp.raise_for_status()
116
+ return EpisodeLog.model_validate(resp.json())
117
+
118
+ # -- properties ----------------------------------------------------------
119
+
120
+ @property
121
+ def session_id(self) -> Optional[str]:
122
+ return self._session_id
123
+
124
+ @property
125
+ def episode_id(self) -> Optional[str]:
126
+ return self._episode_id
127
+
128
+
129
+ class _WsTransport:
130
+ """Sync WebSocket transport using websocket-client."""
131
+
132
+ def __init__(self, base_url: str, timeout: float) -> None:
133
+ # Convert http(s):// → ws(s)://
134
+ ws_url = base_url.rstrip("/")
135
+ ws_url = ws_url.replace("https://", "wss://").replace("http://", "ws://")
136
+ self._ws_url = ws_url + "/ws"
137
+ self._timeout = timeout
138
+ self._ws: Optional[ws_lib.WebSocket] = None
139
+ self._episode_id: Optional[str] = None
140
+ self._lock = threading.Lock()
141
+
142
+ # -- lifecycle -----------------------------------------------------------
143
+
144
+ def connect(self) -> None:
145
+ self._ws = ws_lib.create_connection(
146
+ self._ws_url, timeout=self._timeout
147
+ )
148
+
149
+ def close(self) -> None:
150
+ if self._ws is not None:
151
+ try:
152
+ self._ws.close()
153
+ except Exception:
154
+ pass
155
+ self._ws = None
156
+ self._episode_id = None
157
+
158
+ # -- low-level send/recv -------------------------------------------------
159
+
160
+ def _send(self, payload: dict) -> dict:
161
+ if self._ws is None:
162
+ raise RuntimeError("Call connect() before sending messages")
163
+ with self._lock:
164
+ self._ws.send(json.dumps(payload))
165
+ raw = self._ws.recv()
166
+ data = json.loads(raw)
167
+ if data.get("type") == "error":
168
+ raise RuntimeError(f"Server error: {data.get('message', '')}")
169
+ return data
170
+
171
+ # -- env operations ------------------------------------------------------
172
+
173
+ def reset(
174
+ self,
175
+ seed: int,
176
+ scenario: str,
177
+ difficulty: str,
178
+ ) -> Observation:
179
+ data = self._send({
180
+ "type": "reset",
181
+ "seed": seed,
182
+ "scenario": scenario,
183
+ "difficulty": difficulty,
184
+ })
185
+ if data.get("type") != "reset_ok":
186
+ raise RuntimeError(f"Unexpected response type: {data.get('type')}")
187
+ self._episode_id = data.get("episode_id")
188
+ return Observation.model_validate(data["observation"])
189
+
190
+ def step(self, action: ScientistAction) -> StepResult:
191
+ data = self._send({
192
+ "type": "step",
193
+ "action": action.model_dump(),
194
+ })
195
+ if data.get("type") != "step_ok":
196
+ raise RuntimeError(f"Unexpected response type: {data.get('type')}")
197
+ return StepResult(
198
+ observation=Observation.model_validate(data["observation"])
199
+ if data.get("observation")
200
+ else None,
201
+ reward=data.get("reward", 0.0),
202
+ done=data.get("done", False),
203
+ info=StepInfo.model_validate(data.get("info", {})),
204
+ )
205
+
206
+ def state(self) -> dict:
207
+ raise NotImplementedError(
208
+ "state() is not available over WebSocket. Use REST transport or "
209
+ "track state from step() results."
210
+ )
211
+
212
+ def replay(self, episode_id: str) -> EpisodeLog:
213
+ raise NotImplementedError(
214
+ "replay() is not available over WebSocket. Use REST transport or "
215
+ "a separate httpx call to GET /replay/{episode_id}."
216
+ )
217
+
218
+ # -- properties ----------------------------------------------------------
219
+
220
+ @property
221
+ def session_id(self) -> Optional[str]:
222
+ return None # WS sessions are implicit per-connection
223
+
224
+ @property
225
+ def episode_id(self) -> Optional[str]:
226
+ return self._episode_id
227
+
228
+
229
+ # ---------------------------------------------------------------------------
230
+ # Public client
231
+ # ---------------------------------------------------------------------------
232
+
233
+
234
+ class ReplicaLabClient:
235
+ """Reusable sync client for the ReplicaLab environment server.
236
+
237
+ Parameters
238
+ ----------
239
+ base_url:
240
+ Server URL, e.g. ``"http://localhost:7860"``.
241
+ transport:
242
+ ``"websocket"`` (default) or ``"rest"``.
243
+ timeout:
244
+ Request/connection timeout in seconds.
245
+ """
246
+
247
+ def __init__(
248
+ self,
249
+ base_url: str = f"http://localhost:{API_PORT}",
250
+ *,
251
+ transport: str = "websocket",
252
+ timeout: float = 30.0,
253
+ ) -> None:
254
+ if transport == "websocket":
255
+ self._transport: _RestTransport | _WsTransport = _WsTransport(
256
+ base_url, timeout
257
+ )
258
+ elif transport == "rest":
259
+ self._transport = _RestTransport(base_url, timeout)
260
+ else:
261
+ raise ValueError(f"Unknown transport: {transport!r}. Use 'websocket' or 'rest'.")
262
+ self._connected = False
263
+
264
+ # -- context manager -----------------------------------------------------
265
+
266
+ def __enter__(self) -> "ReplicaLabClient":
267
+ self.connect()
268
+ return self
269
+
270
+ def __exit__(self, *exc) -> None:
271
+ self.close()
272
+
273
+ # -- lifecycle -----------------------------------------------------------
274
+
275
+ def connect(self) -> None:
276
+ """Open the connection to the server."""
277
+ self._transport.connect()
278
+ self._connected = True
279
+
280
+ def close(self) -> None:
281
+ """Close the connection and release resources."""
282
+ self._transport.close()
283
+ self._connected = False
284
+
285
+ # -- env operations ------------------------------------------------------
286
+
287
+ def reset(
288
+ self,
289
+ seed: int = 0,
290
+ scenario: str = DEFAULT_SCENARIO_TEMPLATE,
291
+ difficulty: str = DEFAULT_DIFFICULTY,
292
+ ) -> Observation:
293
+ """Start a new episode. Returns the initial observation."""
294
+ self._ensure_connected()
295
+ return self._transport.reset(seed, scenario, difficulty)
296
+
297
+ def step(self, action: ScientistAction) -> StepResult:
298
+ """Submit a Scientist action. Returns the step result."""
299
+ self._ensure_connected()
300
+ return self._transport.step(action)
301
+
302
+ def state(self) -> dict:
303
+ """Get current episode state (REST only)."""
304
+ self._ensure_connected()
305
+ return self._transport.state()
306
+
307
+ def replay(self, episode_id: str) -> EpisodeLog:
308
+ """Fetch a completed episode log (REST only)."""
309
+ self._ensure_connected()
310
+ return self._transport.replay(episode_id)
311
+
312
+ # -- properties ----------------------------------------------------------
313
+
314
+ @property
315
+ def session_id(self) -> Optional[str]:
316
+ """REST session ID, or ``None`` for WebSocket transport."""
317
+ return self._transport.session_id
318
+
319
+ @property
320
+ def episode_id(self) -> Optional[str]:
321
+ """Current episode ID set after the most recent ``reset()``."""
322
+ return self._transport.episode_id
323
+
324
+ @property
325
+ def connected(self) -> bool:
326
+ """Whether ``connect()`` has been called."""
327
+ return self._connected
328
+
329
+ # -- internal ------------------------------------------------------------
330
+
331
+ def _ensure_connected(self) -> None:
332
+ if not self._connected:
333
+ raise RuntimeError("Client not connected. Call connect() or use as context manager.")
replicalab/scoring/__init__.py CHANGED
@@ -3,8 +3,11 @@
3
  from .feasibility import score_feasibility
4
  from .fidelity import score_fidelity
5
  from .rigor import score_rigor
 
6
 
7
  __all__ = [
 
 
8
  "score_feasibility",
9
  "score_fidelity",
10
  "score_rigor",
 
3
  from .feasibility import score_feasibility
4
  from .fidelity import score_fidelity
5
  from .rigor import score_rigor
6
+ from .rubric import build_reward_breakdown, compute_total_reward
7
 
8
  __all__ = [
9
+ "build_reward_breakdown",
10
+ "compute_total_reward",
11
  "score_feasibility",
12
  "score_fidelity",
13
  "score_rigor",
replicalab/scoring/rubric.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """JDG 04-05 — Total reward computation and reward breakdown builder.
2
+
3
+ Combines rigor (JDG 01), feasibility (JDG 02), and fidelity (JDG 03)
4
+ into a single scalar reward with efficiency bonus and penalties.
5
+
6
+ Formula: total = 10 × rigor × feasibility × fidelity + bonuses − penalties
7
+
8
+ Pure deterministic functions — no model calls, no side effects.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ from replicalab.agents.lab_manager_policy import (
14
+ FeasibilityCheckResult,
15
+ check_feasibility,
16
+ )
17
+ from replicalab.models import Protocol, RewardBreakdown
18
+ from replicalab.scenarios.templates import NormalizedScenarioPack
19
+ from replicalab.scoring.feasibility import score_feasibility
20
+ from replicalab.scoring.fidelity import score_fidelity
21
+ from replicalab.scoring.rigor import score_rigor
22
+
23
+
24
+ _REWARD_SCALE = 10.0
25
+ _MAX_EFFICIENCY_BONUS = 1.0
26
+ _MAX_COMMUNICATION_BONUS = 0.0 # reserved for future use
27
+
28
+
29
+ def compute_total_reward(breakdown: RewardBreakdown) -> float:
30
+ """Compute the scalar reward from a RewardBreakdown.
31
+
32
+ Formula: 10 × rigor × feasibility × fidelity + efficiency_bonus
33
+ + communication_bonus − sum(penalties)
34
+ """
35
+ base = _REWARD_SCALE * breakdown.rigor * breakdown.feasibility * breakdown.fidelity
36
+ bonus = breakdown.efficiency_bonus + breakdown.communication_bonus
37
+ penalty = sum(breakdown.penalties.values())
38
+ return max(0.0, round(base + bonus - penalty, 6))
39
+
40
+
41
+ def build_reward_breakdown(
42
+ protocol: Protocol,
43
+ scenario: NormalizedScenarioPack,
44
+ rounds_used: int,
45
+ max_rounds: int,
46
+ *,
47
+ check: FeasibilityCheckResult | None = None,
48
+ penalties: dict[str, float] | None = None,
49
+ ) -> RewardBreakdown:
50
+ """Build a full RewardBreakdown from the three sub-scores plus bonuses.
51
+
52
+ Parameters
53
+ ----------
54
+ protocol : Protocol
55
+ The final agreed protocol.
56
+ scenario : NormalizedScenarioPack
57
+ The scenario pack for this episode.
58
+ rounds_used : int
59
+ How many rounds were consumed.
60
+ max_rounds : int
61
+ The episode's round cap.
62
+ check : FeasibilityCheckResult, optional
63
+ Pre-computed feasibility check to avoid redundant work.
64
+ penalties : dict[str, float], optional
65
+ Named penalty keys for bounded-tool diagnostics, unsupported
66
+ evidence claims, or other deterministic deductions. Use named
67
+ keys (e.g. ``"invalid_tool_use"``, ``"unsupported_claim"``)
68
+ instead of adding new fields to RewardBreakdown.
69
+ """
70
+ if check is None:
71
+ check = check_feasibility(protocol, scenario)
72
+
73
+ rigor = score_rigor(protocol, scenario)
74
+ feasibility = score_feasibility(protocol, scenario, check=check)
75
+ fidelity = score_fidelity(protocol, scenario)
76
+
77
+ efficiency_bonus = _efficiency_bonus(rounds_used, max_rounds)
78
+ merged_penalties = dict(penalties) if penalties else {}
79
+
80
+ return RewardBreakdown(
81
+ rigor=rigor,
82
+ feasibility=feasibility,
83
+ fidelity=fidelity,
84
+ efficiency_bonus=efficiency_bonus,
85
+ communication_bonus=0.0,
86
+ penalties=merged_penalties,
87
+ )
88
+
89
+
90
+ def _efficiency_bonus(rounds_used: int, max_rounds: int) -> float:
91
+ """Reward finishing in fewer rounds.
92
+
93
+ If the scientist reaches agreement in round 1 of 6, that's the maximum
94
+ bonus. If they use all rounds, the bonus is 0.
95
+ """
96
+ if max_rounds <= 1 or rounds_used <= 0:
97
+ return 0.0
98
+ saved = max(0, max_rounds - rounds_used)
99
+ return round(_MAX_EFFICIENCY_BONUS * saved / (max_rounds - 1), 6)
replicalab/utils/logging.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Episode logging and replay persistence helpers.
2
+
3
+ MOD 07 provides the persistence boundary for episode replay, notebook
4
+ inspection, and later API replay retrieval. All writes are atomic
5
+ (temp file + rename) so a crash never leaves a half-written replay.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import csv
11
+ import io
12
+ import os
13
+ import tempfile
14
+ from pathlib import Path
15
+ from typing import TypeVar
16
+
17
+ from pydantic import BaseModel
18
+
19
+ from replicalab.models import EpisodeLog
20
+
21
+ _M = TypeVar("_M", bound=BaseModel)
22
+
23
+ _DEFAULT_REPLAYS_DIR = Path(__file__).resolve().parents[2] / "replicalab" / "outputs" / "replays"
24
+ _DEFAULT_LOGS_DIR = Path(__file__).resolve().parents[2] / "replicalab" / "outputs" / "logs"
25
+
26
+
27
+ # ---------------------------------------------------------------------------
28
+ # Internal helper — atomic JSON write for any Pydantic model
29
+ # ---------------------------------------------------------------------------
30
+
31
+
32
+ def _write_json_model(model: BaseModel, path: Path) -> Path:
33
+ """Serialize a Pydantic model to *path* atomically.
34
+
35
+ Writes to a temporary file in the same directory, then renames so
36
+ readers never see a partial file.
37
+ """
38
+ path = Path(path)
39
+ path.parent.mkdir(parents=True, exist_ok=True)
40
+
41
+ data = model.model_dump_json(indent=2)
42
+
43
+ fd, tmp = tempfile.mkstemp(dir=str(path.parent), suffix=".tmp")
44
+ try:
45
+ with os.fdopen(fd, "w", encoding="utf-8") as fh:
46
+ fh.write(data)
47
+ # On Windows, target must not exist for os.rename; use replace.
48
+ os.replace(tmp, str(path))
49
+ except BaseException:
50
+ # Clean up the temp file on any failure.
51
+ try:
52
+ os.unlink(tmp)
53
+ except OSError:
54
+ pass
55
+ raise
56
+
57
+ return path
58
+
59
+
60
+ # ---------------------------------------------------------------------------
61
+ # Public API
62
+ # ---------------------------------------------------------------------------
63
+
64
+
65
+ def write_episode_log(
66
+ log: EpisodeLog,
67
+ directory: Path | str | None = None,
68
+ ) -> Path:
69
+ """Persist a completed episode log as JSON.
70
+
71
+ Parameters
72
+ ----------
73
+ log:
74
+ The completed episode record.
75
+ directory:
76
+ Target directory. Defaults to ``replicalab/outputs/replays/``.
77
+
78
+ Returns
79
+ -------
80
+ Path
81
+ Absolute path to the written file.
82
+ """
83
+ directory = Path(directory) if directory is not None else _DEFAULT_REPLAYS_DIR
84
+ filename = f"{log.episode_id}.json" if log.episode_id else "unknown.json"
85
+ return _write_json_model(log, directory / filename)
86
+
87
+
88
+ def load_episode_log(path: Path | str) -> EpisodeLog:
89
+ """Load an episode log from a JSON file.
90
+
91
+ Raises
92
+ ------
93
+ FileNotFoundError
94
+ If *path* does not exist.
95
+ pydantic.ValidationError
96
+ If the file contents do not match the ``EpisodeLog`` schema.
97
+ """
98
+ path = Path(path)
99
+ raw = path.read_text(encoding="utf-8")
100
+ return EpisodeLog.model_validate_json(raw)
101
+
102
+
103
+ def append_reward_csv(
104
+ path: Path | str | None = None,
105
+ *,
106
+ episode_id: str = "",
107
+ seed: int = 0,
108
+ scenario_template: str = "",
109
+ difficulty: str = "",
110
+ total_reward: float = 0.0,
111
+ rigor: float = 0.0,
112
+ feasibility: float = 0.0,
113
+ fidelity: float = 0.0,
114
+ rounds_used: int = 0,
115
+ agreement_reached: bool = False,
116
+ ) -> Path:
117
+ """Append one row to a reward CSV file.
118
+
119
+ Creates the file with a header if it does not exist.
120
+ Pre-stages the format that JDG 07 will consume.
121
+ """
122
+ path = Path(path) if path is not None else _DEFAULT_LOGS_DIR / "rewards.csv"
123
+ path.parent.mkdir(parents=True, exist_ok=True)
124
+
125
+ fieldnames = [
126
+ "episode_id",
127
+ "seed",
128
+ "scenario_template",
129
+ "difficulty",
130
+ "total_reward",
131
+ "rigor",
132
+ "feasibility",
133
+ "fidelity",
134
+ "rounds_used",
135
+ "agreement_reached",
136
+ ]
137
+
138
+ write_header = not path.exists() or path.stat().st_size == 0
139
+
140
+ row = {
141
+ "episode_id": episode_id,
142
+ "seed": seed,
143
+ "scenario_template": scenario_template,
144
+ "difficulty": difficulty,
145
+ "total_reward": total_reward,
146
+ "rigor": rigor,
147
+ "feasibility": feasibility,
148
+ "fidelity": fidelity,
149
+ "rounds_used": rounds_used,
150
+ "agreement_reached": agreement_reached,
151
+ }
152
+
153
+ with open(path, "a", newline="", encoding="utf-8") as fh:
154
+ writer = csv.DictWriter(fh, fieldnames=fieldnames)
155
+ if write_header:
156
+ writer.writeheader()
157
+ writer.writerow(row)
158
+
159
+ return path
server/Dockerfile CHANGED
@@ -17,8 +17,8 @@ COPY replicalab/ ./replicalab/
17
  COPY server/ ./server/
18
  COPY pyproject.toml ./
19
 
20
- # Install the replicalab package in editable mode
21
- RUN pip install --no-cache-dir -e . --no-deps
22
 
23
  # Run as a non-root user inside the container
24
  RUN useradd -m -u 1000 appuser && chown -R appuser /app
 
17
  COPY server/ ./server/
18
  COPY pyproject.toml ./
19
 
20
+ # Install the replicalab package (non-editable, deps already present)
21
+ RUN pip install --no-cache-dir . --no-deps
22
 
23
  # Run as a non-root user inside the container
24
  RUN useradd -m -u 1000 appuser && chown -R appuser /app
server/app.py CHANGED
@@ -76,7 +76,7 @@ logging.basicConfig(
76
  log = logging.getLogger("replicalab.server")
77
 
78
  # ---------------------------------------------------------------------------
79
- # Environment factory — swap _StubEnv for ReplicaLabEnv once Person A ships it
80
  # ---------------------------------------------------------------------------
81
 
82
  try:
@@ -89,21 +89,17 @@ except ImportError:
89
  log.warning("ReplicaLabEnv not found — using _StubEnv (replace when Person A ships env)")
90
 
91
 
92
- def _reward_breakdown_from_state(state: EpisodeState) -> RewardBreakdown:
93
- return RewardBreakdown(
94
- rigor=state.rigor_score,
95
- feasibility=state.feasibility_score,
96
- fidelity=state.fidelity_score,
97
- efficiency_bonus=0.0,
98
- communication_bonus=0.0,
99
- penalties={
100
- "invalid_action": 0.0,
101
- "timeout": 0.0,
102
- },
103
- )
104
-
105
 
106
- def _build_episode_log(episode_id: str, state: EpisodeState) -> EpisodeLog:
 
 
 
107
  return EpisodeLog(
108
  episode_id=episode_id,
109
  seed=state.seed,
@@ -111,12 +107,12 @@ def _build_episode_log(episode_id: str, state: EpisodeState) -> EpisodeLog:
111
  difficulty=state.difficulty,
112
  final_state=state,
113
  transcript=list(state.conversation_history),
114
- reward_breakdown=_reward_breakdown_from_state(state),
115
- total_reward=state.reward,
116
  rounds_used=state.round_number,
117
- agreement_reached=state.agreement_reached,
118
- judge_notes="Stub audit until judge integration lands.",
119
- verdict="accept" if state.agreement_reached else "revise",
120
  )
121
 
122
 
@@ -198,7 +194,11 @@ class _StubEnv:
198
  info=StepInfo(
199
  agreement_reached=self._state.agreement_reached,
200
  error=None,
201
- reward_breakdown=_reward_breakdown_from_state(self._state) if done else None,
 
 
 
 
202
  judge_notes="Stub audit until judge integration lands." if done else None,
203
  verdict=("accept" if self._state.agreement_reached else "revise") if done else None,
204
  round=self._state.round_number,
@@ -416,6 +416,10 @@ class ResetResponse(BaseModel):
416
  observation: Observation
417
 
418
 
 
 
 
 
419
  class StepRequest(BaseModel):
420
  session_id: str
421
  action: ScientistAction
@@ -431,9 +435,9 @@ async def health():
431
  return {"status": "ok", "env": "real" if _HAS_REAL_ENV else "stub"}
432
 
433
 
434
- @app.get("/scenarios")
435
  async def list_scenarios():
436
- return {"scenarios": SCENARIOS}
437
 
438
 
439
  @app.post("/reset", response_model=ResetResponse)
@@ -478,6 +482,7 @@ async def step_episode(req: StepRequest):
478
  _replay_store[session["episode_id"]] = _build_episode_log(
479
  session["episode_id"],
480
  state,
 
481
  )
482
  log.info(
483
  "Episode done | session=%s episode=%s reward=%.2f",
@@ -596,7 +601,9 @@ async def websocket_endpoint(ws: WebSocket):
596
  # Store completed episode for REST replay
597
  if result.done and episode_id:
598
  state = env.state()
599
- _replay_store[episode_id] = _build_episode_log(episode_id, state)
 
 
600
 
601
  await _ws_send(
602
  ws,
 
76
  log = logging.getLogger("replicalab.server")
77
 
78
  # ---------------------------------------------------------------------------
79
+ # Environment factory — prefer ReplicaLabEnv, retain _StubEnv only as fallback
80
  # ---------------------------------------------------------------------------
81
 
82
  try:
 
89
  log.warning("ReplicaLabEnv not found — using _StubEnv (replace when Person A ships env)")
90
 
91
 
92
+ def _build_episode_log(
93
+ episode_id: str,
94
+ state: EpisodeState,
95
+ result: StepResult,
96
+ ) -> EpisodeLog:
97
+ """Build an EpisodeLog from the terminal StepResult.
 
 
 
 
 
 
 
98
 
99
+ Uses the real reward_breakdown, judge_notes, and verdict from the env
100
+ instead of rebuilding from state with stale stub values.
101
+ """
102
+ info = result.info
103
  return EpisodeLog(
104
  episode_id=episode_id,
105
  seed=state.seed,
 
107
  difficulty=state.difficulty,
108
  final_state=state,
109
  transcript=list(state.conversation_history),
110
+ reward_breakdown=info.reward_breakdown,
111
+ total_reward=result.reward,
112
  rounds_used=state.round_number,
113
+ agreement_reached=info.agreement_reached,
114
+ judge_notes=info.judge_notes or "",
115
+ verdict=info.verdict or "",
116
  )
117
 
118
 
 
194
  info=StepInfo(
195
  agreement_reached=self._state.agreement_reached,
196
  error=None,
197
+ reward_breakdown=RewardBreakdown(
198
+ rigor=self._state.rigor_score,
199
+ feasibility=self._state.feasibility_score,
200
+ fidelity=self._state.fidelity_score,
201
+ ) if done else None,
202
  judge_notes="Stub audit until judge integration lands." if done else None,
203
  verdict=("accept" if self._state.agreement_reached else "revise") if done else None,
204
  round=self._state.round_number,
 
416
  observation: Observation
417
 
418
 
419
+ class ScenariosResponse(BaseModel):
420
+ scenarios: list[dict]
421
+
422
+
423
  class StepRequest(BaseModel):
424
  session_id: str
425
  action: ScientistAction
 
435
  return {"status": "ok", "env": "real" if _HAS_REAL_ENV else "stub"}
436
 
437
 
438
+ @app.get("/scenarios", response_model=ScenariosResponse)
439
  async def list_scenarios():
440
+ return ScenariosResponse(scenarios=SCENARIOS)
441
 
442
 
443
  @app.post("/reset", response_model=ResetResponse)
 
482
  _replay_store[session["episode_id"]] = _build_episode_log(
483
  session["episode_id"],
484
  state,
485
+ result,
486
  )
487
  log.info(
488
  "Episode done | session=%s episode=%s reward=%.2f",
 
601
  # Store completed episode for REST replay
602
  if result.done and episode_id:
603
  state = env.state()
604
+ _replay_store[episode_id] = _build_episode_log(
605
+ episode_id, state, result
606
+ )
607
 
608
  await _ws_send(
609
  ws,
server/requirements.txt CHANGED
@@ -2,3 +2,5 @@ fastapi>=0.115,<1.0
2
  uvicorn[standard]>=0.34,<1.0
3
  websockets>=15.0,<17.0
4
  pydantic>=2.7,<3.0
 
 
 
2
  uvicorn[standard]>=0.34,<1.0
3
  websockets>=15.0,<17.0
4
  pydantic>=2.7,<3.0
5
+ httpx>=0.27,<1.0
6
+ websocket-client>=1.7,<2.0
tests/fixtures/api_schema_examples.json ADDED
@@ -0,0 +1,491 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_meta": {
3
+ "generated_by": "tests/fixtures/generate_api_examples.py",
4
+ "description": "API schema examples generated from real Pydantic models. Re-run the script to regenerate after contract changes.",
5
+ "seed": 42,
6
+ "scenario_template": "math_reasoning",
7
+ "difficulty": "easy"
8
+ },
9
+ "rest": {
10
+ "POST /reset": {
11
+ "request": {
12
+ "seed": 42,
13
+ "scenario": "math_reasoning",
14
+ "difficulty": "easy",
15
+ "session_id": null
16
+ },
17
+ "response": {
18
+ "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
19
+ "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
20
+ "observation": {
21
+ "scientist": {
22
+ "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
23
+ "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
24
+ "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
25
+ "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
26
+ "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
27
+ "conversation_history": [],
28
+ "current_protocol": null,
29
+ "round_number": 0,
30
+ "max_rounds": 6
31
+ },
32
+ "lab_manager": {
33
+ "budget_total": 345.0,
34
+ "budget_remaining": 345.0,
35
+ "equipment_available": [
36
+ "Structured proof notebook"
37
+ ],
38
+ "equipment_booked": [],
39
+ "reagents_in_stock": [
40
+ "Reference theorem library",
41
+ "Graduate reviewer"
42
+ ],
43
+ "reagents_out_of_stock": [],
44
+ "staff_count": 1,
45
+ "time_limit_days": 3,
46
+ "safety_restrictions": [
47
+ "The outline should stay concise enough for seminar notes."
48
+ ],
49
+ "conversation_history": [],
50
+ "current_protocol": null,
51
+ "round_number": 0,
52
+ "max_rounds": 6
53
+ }
54
+ }
55
+ }
56
+ },
57
+ "POST /step": {
58
+ "request": {
59
+ "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
60
+ "action": {
61
+ "action_type": "propose_protocol",
62
+ "sample_size": 30,
63
+ "controls": [
64
+ "positive_control",
65
+ "negative_control"
66
+ ],
67
+ "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
68
+ "duration_days": 5,
69
+ "required_equipment": [
70
+ "Structured proof notebook"
71
+ ],
72
+ "required_reagents": [
73
+ "Reference theorem library",
74
+ "Graduate reviewer"
75
+ ],
76
+ "questions": [],
77
+ "rationale": "Initial proposal using available resources."
78
+ }
79
+ },
80
+ "response_mid_episode": {
81
+ "observation": {
82
+ "scientist": {
83
+ "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
84
+ "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
85
+ "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
86
+ "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
87
+ "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
88
+ "conversation_history": [
89
+ {
90
+ "role": "scientist",
91
+ "message": "Initial proposal using available resources.",
92
+ "round_number": 1,
93
+ "action_type": "propose_protocol"
94
+ },
95
+ {
96
+ "role": "lab_manager",
97
+ "message": "Budget is within range. Equipment is available.",
98
+ "round_number": 1,
99
+ "action_type": "report_feasibility"
100
+ }
101
+ ],
102
+ "current_protocol": {
103
+ "sample_size": 30,
104
+ "controls": [
105
+ "positive_control",
106
+ "negative_control"
107
+ ],
108
+ "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
109
+ "duration_days": 5,
110
+ "required_equipment": [
111
+ "Structured proof notebook"
112
+ ],
113
+ "required_reagents": [
114
+ "Reference theorem library",
115
+ "Graduate reviewer"
116
+ ],
117
+ "rationale": "Initial proposal using available resources."
118
+ },
119
+ "round_number": 1,
120
+ "max_rounds": 6
121
+ },
122
+ "lab_manager": {
123
+ "budget_total": 345.0,
124
+ "budget_remaining": 345.0,
125
+ "equipment_available": [
126
+ "Structured proof notebook"
127
+ ],
128
+ "equipment_booked": [],
129
+ "reagents_in_stock": [
130
+ "Reference theorem library",
131
+ "Graduate reviewer"
132
+ ],
133
+ "reagents_out_of_stock": [],
134
+ "staff_count": 1,
135
+ "time_limit_days": 3,
136
+ "safety_restrictions": [
137
+ "The outline should stay concise enough for seminar notes."
138
+ ],
139
+ "conversation_history": [
140
+ {
141
+ "role": "scientist",
142
+ "message": "Initial proposal using available resources.",
143
+ "round_number": 1,
144
+ "action_type": "propose_protocol"
145
+ },
146
+ {
147
+ "role": "lab_manager",
148
+ "message": "Budget is within range. Equipment is available.",
149
+ "round_number": 1,
150
+ "action_type": "report_feasibility"
151
+ }
152
+ ],
153
+ "current_protocol": {
154
+ "sample_size": 30,
155
+ "controls": [
156
+ "positive_control",
157
+ "negative_control"
158
+ ],
159
+ "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
160
+ "duration_days": 5,
161
+ "required_equipment": [
162
+ "Structured proof notebook"
163
+ ],
164
+ "required_reagents": [
165
+ "Reference theorem library",
166
+ "Graduate reviewer"
167
+ ],
168
+ "rationale": "Initial proposal using available resources."
169
+ },
170
+ "round_number": 1,
171
+ "max_rounds": 6
172
+ }
173
+ },
174
+ "reward": 0.0,
175
+ "done": false,
176
+ "info": {
177
+ "agreement_reached": false,
178
+ "error": null,
179
+ "reward_breakdown": null,
180
+ "judge_notes": null,
181
+ "verdict": null
182
+ }
183
+ },
184
+ "response_terminal": {
185
+ "observation": null,
186
+ "reward": 5.0,
187
+ "done": true,
188
+ "info": {
189
+ "agreement_reached": true,
190
+ "error": null,
191
+ "reward_breakdown": {
192
+ "rigor": 0.8,
193
+ "feasibility": 0.8,
194
+ "fidelity": 0.8,
195
+ "efficiency_bonus": 0.2,
196
+ "communication_bonus": 0.1,
197
+ "penalties": {
198
+ "timeout": 0.0
199
+ }
200
+ },
201
+ "judge_notes": "Stub audit until judge integration lands.",
202
+ "verdict": "accept"
203
+ }
204
+ }
205
+ },
206
+ "GET /scenarios": {
207
+ "response": {
208
+ "scenarios": [
209
+ {
210
+ "family": "math_reasoning",
211
+ "difficulties": [
212
+ "easy",
213
+ "medium",
214
+ "hard"
215
+ ]
216
+ },
217
+ {
218
+ "family": "ml_benchmark",
219
+ "difficulties": [
220
+ "easy",
221
+ "medium",
222
+ "hard"
223
+ ]
224
+ },
225
+ {
226
+ "family": "finance_trading",
227
+ "difficulties": [
228
+ "easy",
229
+ "medium",
230
+ "hard"
231
+ ]
232
+ }
233
+ ]
234
+ }
235
+ },
236
+ "GET /replay/{episode_id}": {
237
+ "response": {
238
+ "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
239
+ "seed": 42,
240
+ "scenario_template": "math_reasoning",
241
+ "difficulty": "easy",
242
+ "final_state": {
243
+ "seed": 42,
244
+ "scenario_template": "math_reasoning",
245
+ "difficulty": "easy",
246
+ "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
247
+ "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
248
+ "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
249
+ "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
250
+ "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
251
+ "lab_budget_total": 345.0,
252
+ "lab_budget_remaining": 345.0,
253
+ "lab_equipment": [
254
+ "Structured proof notebook"
255
+ ],
256
+ "lab_reagents": [
257
+ "Reference theorem library",
258
+ "Graduate reviewer"
259
+ ],
260
+ "lab_staff_count": 1,
261
+ "lab_time_limit_days": 3,
262
+ "current_protocol": null,
263
+ "conversation_history": [],
264
+ "round_number": 3,
265
+ "max_rounds": 6,
266
+ "done": true,
267
+ "agreement_reached": true,
268
+ "reward": 5.0,
269
+ "rigor_score": 0.8,
270
+ "feasibility_score": 0.8,
271
+ "fidelity_score": 0.8
272
+ },
273
+ "transcript": [
274
+ {
275
+ "role": "scientist",
276
+ "message": "Initial proposal using available resources.",
277
+ "round_number": 1,
278
+ "action_type": "propose_protocol"
279
+ },
280
+ {
281
+ "role": "lab_manager",
282
+ "message": "Budget is within range. Equipment is available.",
283
+ "round_number": 1,
284
+ "action_type": "report_feasibility"
285
+ }
286
+ ],
287
+ "reward_breakdown": {
288
+ "rigor": 0.8,
289
+ "feasibility": 0.8,
290
+ "fidelity": 0.8,
291
+ "efficiency_bonus": 0.2,
292
+ "communication_bonus": 0.1,
293
+ "penalties": {
294
+ "timeout": 0.0
295
+ }
296
+ },
297
+ "total_reward": 5.0,
298
+ "rounds_used": 3,
299
+ "agreement_reached": true,
300
+ "judge_notes": "Stub audit until judge integration lands.",
301
+ "verdict": "accept"
302
+ }
303
+ }
304
+ },
305
+ "websocket": {
306
+ "reset": {
307
+ "client_sends": {
308
+ "type": "reset",
309
+ "seed": 42,
310
+ "scenario": "math_reasoning",
311
+ "difficulty": "easy"
312
+ },
313
+ "server_responds": {
314
+ "type": "reset_ok",
315
+ "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
316
+ "observation": {
317
+ "scientist": {
318
+ "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
319
+ "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
320
+ "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
321
+ "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
322
+ "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
323
+ "conversation_history": [],
324
+ "current_protocol": null,
325
+ "round_number": 0,
326
+ "max_rounds": 6
327
+ },
328
+ "lab_manager": {
329
+ "budget_total": 345.0,
330
+ "budget_remaining": 345.0,
331
+ "equipment_available": [
332
+ "Structured proof notebook"
333
+ ],
334
+ "equipment_booked": [],
335
+ "reagents_in_stock": [
336
+ "Reference theorem library",
337
+ "Graduate reviewer"
338
+ ],
339
+ "reagents_out_of_stock": [],
340
+ "staff_count": 1,
341
+ "time_limit_days": 3,
342
+ "safety_restrictions": [
343
+ "The outline should stay concise enough for seminar notes."
344
+ ],
345
+ "conversation_history": [],
346
+ "current_protocol": null,
347
+ "round_number": 0,
348
+ "max_rounds": 6
349
+ }
350
+ }
351
+ }
352
+ },
353
+ "step": {
354
+ "client_sends": {
355
+ "type": "step",
356
+ "action": {
357
+ "action_type": "propose_protocol",
358
+ "sample_size": 30,
359
+ "controls": [
360
+ "positive_control",
361
+ "negative_control"
362
+ ],
363
+ "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
364
+ "duration_days": 5,
365
+ "required_equipment": [
366
+ "Structured proof notebook"
367
+ ],
368
+ "required_reagents": [
369
+ "Reference theorem library",
370
+ "Graduate reviewer"
371
+ ],
372
+ "questions": [],
373
+ "rationale": "Initial proposal using available resources."
374
+ }
375
+ },
376
+ "server_responds": {
377
+ "type": "step_ok",
378
+ "observation": {
379
+ "scientist": {
380
+ "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
381
+ "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
382
+ "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
383
+ "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
384
+ "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
385
+ "conversation_history": [
386
+ {
387
+ "role": "scientist",
388
+ "message": "Initial proposal using available resources.",
389
+ "round_number": 1,
390
+ "action_type": "propose_protocol"
391
+ },
392
+ {
393
+ "role": "lab_manager",
394
+ "message": "Budget is within range. Equipment is available.",
395
+ "round_number": 1,
396
+ "action_type": "report_feasibility"
397
+ }
398
+ ],
399
+ "current_protocol": {
400
+ "sample_size": 30,
401
+ "controls": [
402
+ "positive_control",
403
+ "negative_control"
404
+ ],
405
+ "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
406
+ "duration_days": 5,
407
+ "required_equipment": [
408
+ "Structured proof notebook"
409
+ ],
410
+ "required_reagents": [
411
+ "Reference theorem library",
412
+ "Graduate reviewer"
413
+ ],
414
+ "rationale": "Initial proposal using available resources."
415
+ },
416
+ "round_number": 1,
417
+ "max_rounds": 6
418
+ },
419
+ "lab_manager": {
420
+ "budget_total": 345.0,
421
+ "budget_remaining": 345.0,
422
+ "equipment_available": [
423
+ "Structured proof notebook"
424
+ ],
425
+ "equipment_booked": [],
426
+ "reagents_in_stock": [
427
+ "Reference theorem library",
428
+ "Graduate reviewer"
429
+ ],
430
+ "reagents_out_of_stock": [],
431
+ "staff_count": 1,
432
+ "time_limit_days": 3,
433
+ "safety_restrictions": [
434
+ "The outline should stay concise enough for seminar notes."
435
+ ],
436
+ "conversation_history": [
437
+ {
438
+ "role": "scientist",
439
+ "message": "Initial proposal using available resources.",
440
+ "round_number": 1,
441
+ "action_type": "propose_protocol"
442
+ },
443
+ {
444
+ "role": "lab_manager",
445
+ "message": "Budget is within range. Equipment is available.",
446
+ "round_number": 1,
447
+ "action_type": "report_feasibility"
448
+ }
449
+ ],
450
+ "current_protocol": {
451
+ "sample_size": 30,
452
+ "controls": [
453
+ "positive_control",
454
+ "negative_control"
455
+ ],
456
+ "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
457
+ "duration_days": 5,
458
+ "required_equipment": [
459
+ "Structured proof notebook"
460
+ ],
461
+ "required_reagents": [
462
+ "Reference theorem library",
463
+ "Graduate reviewer"
464
+ ],
465
+ "rationale": "Initial proposal using available resources."
466
+ },
467
+ "round_number": 1,
468
+ "max_rounds": 6
469
+ }
470
+ },
471
+ "reward": 0.0,
472
+ "done": false,
473
+ "info": {
474
+ "agreement_reached": false,
475
+ "error": null,
476
+ "reward_breakdown": null,
477
+ "judge_notes": null,
478
+ "verdict": null
479
+ }
480
+ }
481
+ },
482
+ "ping": {
483
+ "client_sends": {
484
+ "type": "ping"
485
+ },
486
+ "server_responds": {
487
+ "type": "pong"
488
+ }
489
+ }
490
+ }
491
+ }
tests/fixtures/generate_api_examples.py ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Generate api_schema_examples.json from real Pydantic models.
3
+
4
+ MOD 10 — run this script to regenerate the fixture whenever the
5
+ contracts change. The output is deterministic.
6
+
7
+ Usage:
8
+ python tests/fixtures/generate_api_examples.py
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import json
14
+ from pathlib import Path
15
+
16
+ from replicalab.config import DEFAULT_DIFFICULTY, DEFAULT_SCENARIO_TEMPLATE
17
+ from replicalab.models import (
18
+ ConversationEntry,
19
+ EpisodeLog,
20
+ EpisodeState,
21
+ LabManagerObservation,
22
+ Observation,
23
+ Protocol,
24
+ RewardBreakdown,
25
+ ScientistAction,
26
+ ScientistObservation,
27
+ StepInfo,
28
+ StepResult,
29
+ )
30
+ from replicalab.scenarios import available_scenario_families, generate_scenario
31
+
32
+ OUTPUT_PATH = Path(__file__).parent / "api_schema_examples.json"
33
+
34
+ # ---------------------------------------------------------------------------
35
+ # Build realistic payloads from real models
36
+ # ---------------------------------------------------------------------------
37
+
38
+ _SEED = 42
39
+ _TEMPLATE = DEFAULT_SCENARIO_TEMPLATE
40
+ _DIFFICULTY = DEFAULT_DIFFICULTY
41
+
42
+ # Generate a real scenario to extract observation data
43
+ _pack = generate_scenario(seed=_SEED, template=_TEMPLATE, difficulty=_DIFFICULTY)
44
+ _sci_obs = _pack.scientist_observation
45
+ _lm_obs = _pack.lab_manager_observation
46
+
47
+
48
+ def _reset_request():
49
+ return {
50
+ "seed": _SEED,
51
+ "scenario": _TEMPLATE,
52
+ "difficulty": _DIFFICULTY,
53
+ "session_id": None,
54
+ }
55
+
56
+
57
+ def _reset_response():
58
+ obs = Observation(scientist=_sci_obs, lab_manager=_lm_obs)
59
+ return {
60
+ "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
61
+ "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
62
+ "observation": obs.model_dump(),
63
+ }
64
+
65
+
66
+ def _propose_action():
67
+ return ScientistAction(
68
+ action_type="propose_protocol",
69
+ sample_size=30,
70
+ controls=["positive_control", "negative_control"],
71
+ technique=_sci_obs.paper_method,
72
+ duration_days=5,
73
+ required_equipment=list(_lm_obs.equipment_available[:2]) if _lm_obs.equipment_available else ["tool_a"],
74
+ required_reagents=list(_lm_obs.reagents_in_stock[:2]) if _lm_obs.reagents_in_stock else ["ref_a"],
75
+ questions=[],
76
+ rationale="Initial proposal using available resources.",
77
+ ).model_dump()
78
+
79
+
80
+ def _step_request():
81
+ return {
82
+ "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
83
+ "action": _propose_action(),
84
+ }
85
+
86
+
87
+ def _mid_episode_step_result():
88
+ protocol = Protocol(
89
+ sample_size=30,
90
+ controls=["positive_control", "negative_control"],
91
+ technique=_sci_obs.paper_method,
92
+ duration_days=5,
93
+ required_equipment=list(_lm_obs.equipment_available[:2]) if _lm_obs.equipment_available else ["tool_a"],
94
+ required_reagents=list(_lm_obs.reagents_in_stock[:2]) if _lm_obs.reagents_in_stock else ["ref_a"],
95
+ rationale="Initial proposal using available resources.",
96
+ )
97
+
98
+ history = [
99
+ ConversationEntry(
100
+ role="scientist",
101
+ message="Initial proposal using available resources.",
102
+ round_number=1,
103
+ action_type="propose_protocol",
104
+ ),
105
+ ConversationEntry(
106
+ role="lab_manager",
107
+ message="Budget is within range. Equipment is available.",
108
+ round_number=1,
109
+ action_type="report_feasibility",
110
+ ),
111
+ ]
112
+
113
+ obs = Observation(
114
+ scientist=ScientistObservation(
115
+ paper_title=_sci_obs.paper_title,
116
+ paper_hypothesis=_sci_obs.paper_hypothesis,
117
+ paper_method=_sci_obs.paper_method,
118
+ paper_key_finding=_sci_obs.paper_key_finding,
119
+ experiment_goal=_sci_obs.experiment_goal,
120
+ conversation_history=history,
121
+ current_protocol=protocol,
122
+ round_number=1,
123
+ max_rounds=_sci_obs.max_rounds,
124
+ ),
125
+ lab_manager=LabManagerObservation(
126
+ budget_total=_lm_obs.budget_total,
127
+ budget_remaining=_lm_obs.budget_remaining,
128
+ equipment_available=list(_lm_obs.equipment_available),
129
+ equipment_booked=list(_lm_obs.equipment_booked),
130
+ reagents_in_stock=list(_lm_obs.reagents_in_stock),
131
+ reagents_out_of_stock=list(_lm_obs.reagents_out_of_stock),
132
+ staff_count=_lm_obs.staff_count,
133
+ time_limit_days=_lm_obs.time_limit_days,
134
+ safety_restrictions=list(_lm_obs.safety_restrictions),
135
+ conversation_history=history,
136
+ current_protocol=protocol,
137
+ round_number=1,
138
+ max_rounds=_lm_obs.max_rounds,
139
+ ),
140
+ )
141
+
142
+ return StepResult(
143
+ observation=obs,
144
+ reward=0.0,
145
+ done=False,
146
+ info=StepInfo(
147
+ agreement_reached=False,
148
+ error=None,
149
+ reward_breakdown=None,
150
+ judge_notes=None,
151
+ verdict=None,
152
+ ),
153
+ ).model_dump()
154
+
155
+
156
+ def _terminal_step_result():
157
+ return StepResult(
158
+ observation=None,
159
+ reward=5.0,
160
+ done=True,
161
+ info=StepInfo(
162
+ agreement_reached=True,
163
+ error=None,
164
+ reward_breakdown=RewardBreakdown(
165
+ rigor=0.8,
166
+ feasibility=0.8,
167
+ fidelity=0.8,
168
+ efficiency_bonus=0.2,
169
+ communication_bonus=0.1,
170
+ penalties={"timeout": 0.0},
171
+ ),
172
+ judge_notes="Stub audit until judge integration lands.",
173
+ verdict="accept",
174
+ ),
175
+ ).model_dump()
176
+
177
+
178
+ def _scenarios_response():
179
+ return {"scenarios": available_scenario_families()}
180
+
181
+
182
+ def _replay_response():
183
+ return EpisodeLog(
184
+ episode_id="ep-deadbeef-1234-5678-9abc-def012345678",
185
+ seed=_SEED,
186
+ scenario_template=_TEMPLATE,
187
+ difficulty=_DIFFICULTY,
188
+ final_state=EpisodeState(
189
+ seed=_SEED,
190
+ scenario_template=_TEMPLATE,
191
+ difficulty=_DIFFICULTY,
192
+ paper_title=_sci_obs.paper_title,
193
+ paper_hypothesis=_sci_obs.paper_hypothesis,
194
+ paper_method=_sci_obs.paper_method,
195
+ paper_key_finding=_sci_obs.paper_key_finding,
196
+ experiment_goal=_sci_obs.experiment_goal,
197
+ lab_budget_total=_lm_obs.budget_total,
198
+ lab_budget_remaining=_lm_obs.budget_remaining,
199
+ lab_equipment=list(_lm_obs.equipment_available),
200
+ lab_reagents=list(_lm_obs.reagents_in_stock),
201
+ lab_staff_count=_lm_obs.staff_count,
202
+ lab_time_limit_days=_lm_obs.time_limit_days,
203
+ round_number=3,
204
+ max_rounds=_sci_obs.max_rounds,
205
+ done=True,
206
+ agreement_reached=True,
207
+ reward=5.0,
208
+ rigor_score=0.8,
209
+ feasibility_score=0.8,
210
+ fidelity_score=0.8,
211
+ ),
212
+ transcript=[
213
+ ConversationEntry(
214
+ role="scientist",
215
+ message="Initial proposal using available resources.",
216
+ round_number=1,
217
+ action_type="propose_protocol",
218
+ ),
219
+ ConversationEntry(
220
+ role="lab_manager",
221
+ message="Budget is within range. Equipment is available.",
222
+ round_number=1,
223
+ action_type="report_feasibility",
224
+ ),
225
+ ],
226
+ reward_breakdown=RewardBreakdown(
227
+ rigor=0.8,
228
+ feasibility=0.8,
229
+ fidelity=0.8,
230
+ efficiency_bonus=0.2,
231
+ communication_bonus=0.1,
232
+ penalties={"timeout": 0.0},
233
+ ),
234
+ total_reward=5.0,
235
+ rounds_used=3,
236
+ agreement_reached=True,
237
+ judge_notes="Stub audit until judge integration lands.",
238
+ verdict="accept",
239
+ ).model_dump()
240
+
241
+
242
+ def _ws_reset_message():
243
+ return {
244
+ "type": "reset",
245
+ "seed": _SEED,
246
+ "scenario": _TEMPLATE,
247
+ "difficulty": _DIFFICULTY,
248
+ }
249
+
250
+
251
+ def _ws_reset_ok_message():
252
+ obs = Observation(scientist=_sci_obs, lab_manager=_lm_obs)
253
+ return {
254
+ "type": "reset_ok",
255
+ "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
256
+ "observation": obs.model_dump(),
257
+ }
258
+
259
+
260
+ def _ws_step_message():
261
+ return {
262
+ "type": "step",
263
+ "action": _propose_action(),
264
+ }
265
+
266
+
267
+ def _ws_step_ok_message():
268
+ return {
269
+ "type": "step_ok",
270
+ **_mid_episode_step_result(),
271
+ }
272
+
273
+
274
+ # ---------------------------------------------------------------------------
275
+ # Assemble and write
276
+ # ---------------------------------------------------------------------------
277
+
278
+
279
+ def main():
280
+ examples = {
281
+ "_meta": {
282
+ "generated_by": "tests/fixtures/generate_api_examples.py",
283
+ "description": "API schema examples generated from real Pydantic models. Re-run the script to regenerate after contract changes.",
284
+ "seed": _SEED,
285
+ "scenario_template": _TEMPLATE,
286
+ "difficulty": _DIFFICULTY,
287
+ },
288
+ "rest": {
289
+ "POST /reset": {
290
+ "request": _reset_request(),
291
+ "response": _reset_response(),
292
+ },
293
+ "POST /step": {
294
+ "request": _step_request(),
295
+ "response_mid_episode": _mid_episode_step_result(),
296
+ "response_terminal": _terminal_step_result(),
297
+ },
298
+ "GET /scenarios": {
299
+ "response": _scenarios_response(),
300
+ },
301
+ "GET /replay/{episode_id}": {
302
+ "response": _replay_response(),
303
+ },
304
+ },
305
+ "websocket": {
306
+ "reset": {
307
+ "client_sends": _ws_reset_message(),
308
+ "server_responds": _ws_reset_ok_message(),
309
+ },
310
+ "step": {
311
+ "client_sends": _ws_step_message(),
312
+ "server_responds": _ws_step_ok_message(),
313
+ },
314
+ "ping": {
315
+ "client_sends": {"type": "ping"},
316
+ "server_responds": {"type": "pong"},
317
+ },
318
+ },
319
+ }
320
+
321
+ OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
322
+ OUTPUT_PATH.write_text(
323
+ json.dumps(examples, indent=2, ensure_ascii=False) + "\n",
324
+ encoding="utf-8",
325
+ )
326
+ print(f"Wrote {OUTPUT_PATH}")
327
+
328
+
329
+ if __name__ == "__main__":
330
+ main()
tests/test_client.py ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Client module tests — TRN 13.
2
+
3
+ Tests cover ReplicaLabClient with both REST and WebSocket transports
4
+ against the real FastAPI test server.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import contextlib
10
+ import json
11
+ import threading
12
+ import time
13
+
14
+ import pytest
15
+ import uvicorn
16
+
17
+ from replicalab.client import ReplicaLabClient
18
+ from replicalab.models import (
19
+ Observation,
20
+ ScientistAction,
21
+ StepResult,
22
+ )
23
+
24
+
25
+ # ---------------------------------------------------------------------------
26
+ # Helpers
27
+ # ---------------------------------------------------------------------------
28
+
29
+ def _propose_action(obs: Observation) -> ScientistAction:
30
+ """Build a valid propose_protocol action from the observation."""
31
+ from replicalab.scenarios import generate_scenario
32
+
33
+ pack = generate_scenario(seed=42, template="math_reasoning", difficulty="easy")
34
+ lab = pack.lab_manager_observation
35
+ spec = pack.hidden_reference_spec
36
+ return ScientistAction(
37
+ action_type="propose_protocol",
38
+ sample_size=10,
39
+ controls=["baseline", "ablation"],
40
+ technique=spec.summary[:60] if spec.summary else "replication_plan",
41
+ duration_days=max(1, min(2, lab.time_limit_days)),
42
+ required_equipment=list(lab.equipment_available[:1]) if lab.equipment_available else [],
43
+ required_reagents=list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else [],
44
+ questions=[],
45
+ rationale=(
46
+ f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
47
+ f"Target metric: {spec.target_metric}. "
48
+ f"Target value: {spec.target_value}. "
49
+ "Stay within budget and schedule."
50
+ ),
51
+ )
52
+
53
+
54
+ def _accept_action() -> ScientistAction:
55
+ return ScientistAction(
56
+ action_type="accept",
57
+ sample_size=0,
58
+ controls=[],
59
+ technique="",
60
+ duration_days=0,
61
+ required_equipment=[],
62
+ required_reagents=[],
63
+ questions=[],
64
+ rationale="",
65
+ )
66
+
67
+
68
+ # ---------------------------------------------------------------------------
69
+ # REST transport tests (uses httpx directly against TestClient-proxied app)
70
+ # ---------------------------------------------------------------------------
71
+
72
+ # We spin up a real uvicorn server on a random port for both transports
73
+ # to keep things realistic and test the actual HTTP/WS paths.
74
+
75
+ _TEST_PORT = 18765
76
+
77
+
78
+ @pytest.fixture(scope="module")
79
+ def live_server():
80
+ """Start a live uvicorn server for the test module."""
81
+ from server.app import app
82
+
83
+ config = uvicorn.Config(app, host="127.0.0.1", port=_TEST_PORT, log_level="error")
84
+ server = uvicorn.Server(config)
85
+ thread = threading.Thread(target=server.run, daemon=True)
86
+ thread.start()
87
+
88
+ # Wait until server is ready
89
+ import httpx
90
+ for _ in range(50):
91
+ try:
92
+ resp = httpx.get(f"http://127.0.0.1:{_TEST_PORT}/health", timeout=1.0)
93
+ if resp.status_code == 200:
94
+ break
95
+ except Exception:
96
+ pass
97
+ time.sleep(0.1)
98
+ else:
99
+ pytest.fail("Live server did not start in time")
100
+
101
+ yield f"http://127.0.0.1:{_TEST_PORT}"
102
+
103
+ server.should_exit = True
104
+ thread.join(timeout=5)
105
+
106
+
107
+ # ---------------------------------------------------------------------------
108
+ # REST transport
109
+ # ---------------------------------------------------------------------------
110
+
111
+
112
+ class TestRestConnect:
113
+ """connect() over REST verifies server health."""
114
+
115
+ def test_connect_succeeds(self, live_server: str) -> None:
116
+ client = ReplicaLabClient(live_server, transport="rest")
117
+ client.connect()
118
+ assert client.connected
119
+ client.close()
120
+
121
+ def test_connect_bad_url_raises(self) -> None:
122
+ client = ReplicaLabClient("http://127.0.0.1:19999", transport="rest", timeout=1.0)
123
+ with pytest.raises(Exception):
124
+ client.connect()
125
+
126
+
127
+ class TestRestReset:
128
+ """reset() over REST."""
129
+
130
+ def test_reset_returns_observation(self, live_server: str) -> None:
131
+ with ReplicaLabClient(live_server, transport="rest") as client:
132
+ obs = client.reset(seed=42, scenario="math_reasoning", difficulty="easy")
133
+ assert isinstance(obs, Observation)
134
+ assert obs.scientist is not None
135
+ assert obs.scientist.paper_title
136
+ assert obs.lab_manager is not None
137
+ assert obs.lab_manager.budget_total > 0
138
+
139
+ def test_reset_sets_session_and_episode_id(self, live_server: str) -> None:
140
+ with ReplicaLabClient(live_server, transport="rest") as client:
141
+ client.reset(seed=1)
142
+ assert client.session_id is not None
143
+ assert client.episode_id is not None
144
+
145
+ def test_reset_reuses_session(self, live_server: str) -> None:
146
+ with ReplicaLabClient(live_server, transport="rest") as client:
147
+ client.reset(seed=1)
148
+ sid1 = client.session_id
149
+ ep1 = client.episode_id
150
+ client.reset(seed=2)
151
+ assert client.session_id == sid1
152
+ assert client.episode_id != ep1
153
+
154
+
155
+ class TestRestStep:
156
+ """step() over REST."""
157
+
158
+ def test_step_returns_step_result(self, live_server: str) -> None:
159
+ with ReplicaLabClient(live_server, transport="rest") as client:
160
+ obs = client.reset(seed=42)
161
+ action = _propose_action(obs)
162
+ result = client.step(action)
163
+ assert isinstance(result, StepResult)
164
+ assert result.done is False
165
+ assert result.observation is not None
166
+
167
+ def test_step_before_reset_raises(self, live_server: str) -> None:
168
+ with ReplicaLabClient(live_server, transport="rest") as client:
169
+ with pytest.raises(RuntimeError, match="reset"):
170
+ client.step(_accept_action())
171
+
172
+ def test_full_episode_propose_accept(self, live_server: str) -> None:
173
+ with ReplicaLabClient(live_server, transport="rest") as client:
174
+ obs = client.reset(seed=42)
175
+ action = _propose_action(obs)
176
+ result1 = client.step(action)
177
+ assert result1.done is False
178
+
179
+ result2 = client.step(_accept_action())
180
+ assert result2.done is True
181
+ assert result2.reward > 0.0
182
+ assert result2.info.agreement_reached is True
183
+ assert result2.info.verdict == "accept"
184
+ assert result2.info.reward_breakdown is not None
185
+ assert 0.0 <= result2.info.reward_breakdown.rigor <= 1.0
186
+
187
+
188
+ class TestRestReplay:
189
+ """replay() over REST."""
190
+
191
+ def test_replay_after_episode(self, live_server: str) -> None:
192
+ with ReplicaLabClient(live_server, transport="rest") as client:
193
+ obs = client.reset(seed=42)
194
+ action = _propose_action(obs)
195
+ client.step(action)
196
+ client.step(_accept_action())
197
+
198
+ episode_id = client.episode_id
199
+ assert episode_id is not None
200
+ replay = client.replay(episode_id)
201
+ assert replay.agreement_reached is True
202
+ assert replay.total_reward > 0.0
203
+ assert replay.verdict == "accept"
204
+
205
+
206
+ class TestRestContextManager:
207
+ """Context manager cleans up on exit."""
208
+
209
+ def test_context_manager_closes(self, live_server: str) -> None:
210
+ client = ReplicaLabClient(live_server, transport="rest")
211
+ with client:
212
+ assert client.connected
213
+ client.reset(seed=1)
214
+ assert not client.connected
215
+
216
+
217
+ # ---------------------------------------------------------------------------
218
+ # WebSocket transport
219
+ # ---------------------------------------------------------------------------
220
+
221
+
222
+ class TestWsConnect:
223
+ """connect() over WebSocket."""
224
+
225
+ def test_connect_succeeds(self, live_server: str) -> None:
226
+ client = ReplicaLabClient(live_server, transport="websocket")
227
+ client.connect()
228
+ assert client.connected
229
+ client.close()
230
+
231
+ def test_connect_bad_url_raises(self) -> None:
232
+ client = ReplicaLabClient("http://127.0.0.1:19999", transport="websocket", timeout=1.0)
233
+ with pytest.raises(Exception):
234
+ client.connect()
235
+
236
+
237
+ class TestWsReset:
238
+ """reset() over WebSocket."""
239
+
240
+ def test_reset_returns_observation(self, live_server: str) -> None:
241
+ with ReplicaLabClient(live_server, transport="websocket") as client:
242
+ obs = client.reset(seed=42, scenario="math_reasoning", difficulty="easy")
243
+ assert isinstance(obs, Observation)
244
+ assert obs.scientist is not None
245
+ assert obs.scientist.paper_title
246
+ assert obs.lab_manager is not None
247
+ assert obs.lab_manager.budget_total > 0
248
+
249
+ def test_reset_sets_episode_id(self, live_server: str) -> None:
250
+ with ReplicaLabClient(live_server, transport="websocket") as client:
251
+ client.reset(seed=42)
252
+ assert client.episode_id is not None
253
+
254
+ def test_ws_session_id_is_none(self, live_server: str) -> None:
255
+ """WebSocket transport has no explicit session_id."""
256
+ with ReplicaLabClient(live_server, transport="websocket") as client:
257
+ client.reset(seed=42)
258
+ assert client.session_id is None
259
+
260
+
261
+ class TestWsStep:
262
+ """step() over WebSocket."""
263
+
264
+ def test_step_returns_step_result(self, live_server: str) -> None:
265
+ with ReplicaLabClient(live_server, transport="websocket") as client:
266
+ obs = client.reset(seed=42)
267
+ action = _propose_action(obs)
268
+ result = client.step(action)
269
+ assert isinstance(result, StepResult)
270
+ assert result.done is False
271
+ assert result.observation is not None
272
+
273
+ def test_full_episode_propose_accept(self, live_server: str) -> None:
274
+ with ReplicaLabClient(live_server, transport="websocket") as client:
275
+ obs = client.reset(seed=42)
276
+ action = _propose_action(obs)
277
+ result1 = client.step(action)
278
+ assert result1.done is False
279
+
280
+ result2 = client.step(_accept_action())
281
+ assert result2.done is True
282
+ assert result2.reward > 0.0
283
+ assert result2.info.agreement_reached is True
284
+ assert result2.info.verdict == "accept"
285
+ assert result2.info.reward_breakdown is not None
286
+ assert 0.0 <= result2.info.reward_breakdown.rigor <= 1.0
287
+
288
+ def test_semantic_invalid_action_step_ok_with_error(self, live_server: str) -> None:
289
+ """Semantically invalid action → step result with info.error, not crash."""
290
+ with ReplicaLabClient(live_server, transport="websocket") as client:
291
+ client.reset(seed=42)
292
+ bad_action = ScientistAction(
293
+ action_type="propose_protocol",
294
+ sample_size=5,
295
+ controls=["baseline"],
296
+ technique="some technique",
297
+ duration_days=999,
298
+ required_equipment=[],
299
+ required_reagents=[],
300
+ questions=[],
301
+ rationale="Duration is impossibly long.",
302
+ )
303
+ result = client.step(bad_action)
304
+ assert result.done is False
305
+ assert result.info.error is not None
306
+ assert "Validation errors" in result.info.error
307
+
308
+
309
+ class TestWsContextManager:
310
+ """Context manager cleans up on exit."""
311
+
312
+ def test_context_manager_closes(self, live_server: str) -> None:
313
+ client = ReplicaLabClient(live_server, transport="websocket")
314
+ with client:
315
+ assert client.connected
316
+ client.reset(seed=1)
317
+ assert not client.connected
318
+
319
+
320
+ class TestWsUnsupported:
321
+ """state() and replay() raise NotImplementedError on WS transport."""
322
+
323
+ def test_state_not_supported(self, live_server: str) -> None:
324
+ with ReplicaLabClient(live_server, transport="websocket") as client:
325
+ client.reset(seed=42)
326
+ with pytest.raises(NotImplementedError):
327
+ client.state()
328
+
329
+ def test_replay_not_supported(self, live_server: str) -> None:
330
+ with ReplicaLabClient(live_server, transport="websocket") as client:
331
+ with pytest.raises(NotImplementedError):
332
+ client.replay("some-id")
333
+
334
+
335
+ # ---------------------------------------------------------------------------
336
+ # Constructor validation
337
+ # ---------------------------------------------------------------------------
338
+
339
+
340
+ class TestConstructor:
341
+ """Transport selection and validation."""
342
+
343
+ def test_unknown_transport_raises(self) -> None:
344
+ with pytest.raises(ValueError, match="Unknown transport"):
345
+ ReplicaLabClient(transport="grpc")
346
+
347
+ def test_not_connected_raises_on_reset(self) -> None:
348
+ client = ReplicaLabClient(transport="rest")
349
+ with pytest.raises(RuntimeError, match="not connected"):
350
+ client.reset(seed=1)
351
+
352
+ def test_default_transport_is_websocket(self) -> None:
353
+ client = ReplicaLabClient()
354
+ # Check internal transport type
355
+ assert type(client._transport).__name__ == "_WsTransport"
tests/test_env.py ADDED
@@ -0,0 +1,635 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for ENV 01–08 and JDG 04–05.
2
+
3
+ TST 01: reset returns valid observations
4
+ TST 02: valid step advances round, terminal path returns correct shape
5
+ TST 03: invalid action returns structured error, env survives
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import pytest
11
+
12
+ from replicalab.env import ReplicaLabEnv
13
+ from replicalab.models import (
14
+ Protocol,
15
+ RewardBreakdown,
16
+ ScientistAction,
17
+ )
18
+ from replicalab.scenarios import generate_scenario
19
+ from replicalab.scoring.rubric import build_reward_breakdown, compute_total_reward
20
+
21
+
22
+ # ---------------------------------------------------------------------------
23
+ # Helpers
24
+ # ---------------------------------------------------------------------------
25
+
26
+
27
+ def _scenario(template: str = "math_reasoning", difficulty: str = "easy"):
28
+ return generate_scenario(seed=42, template=template, difficulty=difficulty)
29
+
30
+
31
+ def _good_action(scenario) -> ScientistAction:
32
+ """Build a valid propose_protocol action that fits the scenario."""
33
+ lab = scenario.lab_manager_observation
34
+ spec = scenario.hidden_reference_spec
35
+ return ScientistAction(
36
+ action_type="propose_protocol",
37
+ sample_size=10,
38
+ controls=["baseline", "ablation"],
39
+ technique=spec.summary[:60] if spec.summary else "replication_plan",
40
+ duration_days=max(1, min(2, lab.time_limit_days)),
41
+ required_equipment=(
42
+ list(lab.equipment_available[:1]) if lab.equipment_available else []
43
+ ),
44
+ required_reagents=(
45
+ list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else []
46
+ ),
47
+ questions=[],
48
+ rationale=(
49
+ f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
50
+ f"Target metric: {spec.target_metric}. "
51
+ f"Target value: {spec.target_value}. "
52
+ "Stay within budget and schedule."
53
+ ),
54
+ )
55
+
56
+
57
+ def _accept_action() -> ScientistAction:
58
+ """Build a valid accept action."""
59
+ return ScientistAction(
60
+ action_type="accept",
61
+ sample_size=0,
62
+ controls=[],
63
+ technique="",
64
+ duration_days=0,
65
+ required_equipment=[],
66
+ required_reagents=[],
67
+ questions=[],
68
+ rationale="",
69
+ )
70
+
71
+
72
+ def _request_info_action() -> ScientistAction:
73
+ return ScientistAction(
74
+ action_type="request_info",
75
+ sample_size=0,
76
+ controls=[],
77
+ technique="",
78
+ duration_days=0,
79
+ required_equipment=[],
80
+ required_reagents=[],
81
+ questions=["What equipment is available?"],
82
+ rationale="",
83
+ )
84
+
85
+
86
+ def _good_protocol(scenario) -> Protocol:
87
+ """Build a well-formed protocol aligned to the scenario."""
88
+ lab = scenario.lab_manager_observation
89
+ spec = scenario.hidden_reference_spec
90
+ return Protocol(
91
+ sample_size=10,
92
+ controls=["baseline", "ablation"],
93
+ technique=spec.summary[:60] if spec.summary else "replication_plan",
94
+ duration_days=max(1, min(2, lab.time_limit_days)),
95
+ required_equipment=(
96
+ list(lab.equipment_available[:1]) if lab.equipment_available else []
97
+ ),
98
+ required_reagents=(
99
+ list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else []
100
+ ),
101
+ rationale=(
102
+ f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
103
+ f"Target metric: {spec.target_metric}. "
104
+ f"Target value: {spec.target_value}. "
105
+ "Stay within budget and schedule."
106
+ ),
107
+ )
108
+
109
+
110
+ # ---------------------------------------------------------------------------
111
+ # TST 01 — reset returns valid observations
112
+ # ---------------------------------------------------------------------------
113
+
114
+
115
+ class TestReset:
116
+ """TST 01: reset() returns a well-formed Observation."""
117
+
118
+ def test_reset_returns_observation_with_both_roles(self) -> None:
119
+ env = ReplicaLabEnv()
120
+ obs = env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
121
+
122
+ assert obs.scientist is not None
123
+ assert obs.lab_manager is not None
124
+
125
+ def test_reset_scientist_fields_populated(self) -> None:
126
+ env = ReplicaLabEnv()
127
+ obs = env.reset(seed=42, scenario="ml_benchmark", difficulty="easy")
128
+
129
+ s = obs.scientist
130
+ assert s.paper_title
131
+ assert s.paper_hypothesis
132
+ assert s.experiment_goal
133
+ assert s.round_number == 0
134
+ assert s.max_rounds > 0
135
+ assert s.current_protocol is None
136
+ assert s.conversation_history == []
137
+
138
+ def test_reset_lab_manager_fields_populated(self) -> None:
139
+ env = ReplicaLabEnv()
140
+ obs = env.reset(seed=42, scenario="finance_trading", difficulty="easy")
141
+
142
+ lm = obs.lab_manager
143
+ assert lm.budget_total > 0
144
+ assert lm.budget_remaining > 0
145
+ assert lm.staff_count > 0
146
+ assert lm.time_limit_days > 0
147
+ assert lm.round_number == 0
148
+
149
+ def test_reset_preserves_booked_and_out_of_stock(self) -> None:
150
+ """ENV 02: booked/out-of-stock data comes from the scenario pack,
151
+ not hardcoded empty lists."""
152
+ env = ReplicaLabEnv()
153
+ # hard difficulty is more likely to have unavailable resources
154
+ obs = env.reset(seed=42, scenario="ml_benchmark", difficulty="hard")
155
+ lm = obs.lab_manager
156
+
157
+ # The observation should carry scenario data (may or may not have
158
+ # booked items depending on scenario, but the lists should exist)
159
+ assert isinstance(lm.equipment_booked, list)
160
+ assert isinstance(lm.reagents_out_of_stock, list)
161
+ assert isinstance(lm.safety_restrictions, list)
162
+ assert len(lm.safety_restrictions) > 0 # always has at least one
163
+
164
+ def test_reset_state_round_zero(self) -> None:
165
+ env = ReplicaLabEnv()
166
+ env.reset(seed=1)
167
+
168
+ s = env.state()
169
+ assert s.round_number == 0
170
+ assert s.done is False
171
+ assert s.agreement_reached is False
172
+
173
+ def test_reset_generates_episode_id(self) -> None:
174
+ env = ReplicaLabEnv()
175
+ env.reset(seed=1)
176
+
177
+ eid = env.episode_id()
178
+ assert eid
179
+ assert len(eid) > 10 # UUID
180
+
181
+ def test_reset_clears_previous_episode(self) -> None:
182
+ env = ReplicaLabEnv()
183
+ env.reset(seed=1, scenario="math_reasoning")
184
+ first_id = env.episode_id()
185
+
186
+ env.reset(seed=2, scenario="ml_benchmark")
187
+ second_id = env.episode_id()
188
+
189
+ assert first_id != second_id
190
+ assert env.state().round_number == 0
191
+
192
+ def test_reset_all_templates_and_difficulties(self) -> None:
193
+ env = ReplicaLabEnv()
194
+ for template in ("math_reasoning", "ml_benchmark", "finance_trading"):
195
+ for difficulty in ("easy", "medium", "hard"):
196
+ obs = env.reset(seed=7, scenario=template, difficulty=difficulty)
197
+ assert obs.scientist is not None
198
+ assert obs.lab_manager is not None
199
+
200
+
201
+ # ---------------------------------------------------------------------------
202
+ # TST 03 — invalid action returns structured error, env survives
203
+ # ---------------------------------------------------------------------------
204
+
205
+
206
+ class TestInvalidAction:
207
+ """TST 03: env returns structured error for invalid proposals."""
208
+
209
+ def test_invalid_duration_returns_error_string(self) -> None:
210
+ env = ReplicaLabEnv()
211
+ scenario = _scenario("math_reasoning", "easy")
212
+ env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
213
+
214
+ # duration exceeds time limit
215
+ bad_action = ScientistAction(
216
+ action_type="propose_protocol",
217
+ sample_size=5,
218
+ controls=["baseline"],
219
+ technique="some technique",
220
+ duration_days=999,
221
+ required_equipment=[],
222
+ required_reagents=[],
223
+ questions=[],
224
+ rationale="This has way too long a duration for the lab.",
225
+ )
226
+ result = env.step(bad_action)
227
+
228
+ assert result.done is False
229
+ assert result.info.error is not None
230
+ assert "Validation errors" in result.info.error
231
+
232
+ def test_env_survives_after_invalid_action(self) -> None:
233
+ """After returning an error, the env still accepts valid actions."""
234
+ env = ReplicaLabEnv()
235
+ scenario = _scenario("math_reasoning", "easy")
236
+ env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
237
+
238
+ # Send invalid action
239
+ bad_action = ScientistAction(
240
+ action_type="propose_protocol",
241
+ sample_size=5,
242
+ controls=["baseline"],
243
+ technique="some technique",
244
+ duration_days=999,
245
+ required_equipment=[],
246
+ required_reagents=[],
247
+ questions=[],
248
+ rationale="Way too long duration for the lab to handle.",
249
+ )
250
+ error_result = env.step(bad_action)
251
+ assert error_result.info.error is not None
252
+
253
+ # Now send a valid action — env should still work
254
+ good = _good_action(scenario)
255
+ result = env.step(good)
256
+ assert result.info.error is None
257
+ assert result.done is False
258
+
259
+ def test_invalid_action_does_not_advance_round(self) -> None:
260
+ env = ReplicaLabEnv()
261
+ env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
262
+
263
+ bad_action = ScientistAction(
264
+ action_type="propose_protocol",
265
+ sample_size=5,
266
+ controls=["baseline"],
267
+ technique="some technique",
268
+ duration_days=999,
269
+ required_equipment=[],
270
+ required_reagents=[],
271
+ questions=[],
272
+ rationale="Duration is impossibly long for this scenario.",
273
+ )
274
+ result = env.step(bad_action)
275
+
276
+ assert result.info.error is not None
277
+ assert env.state().round_number == 0
278
+
279
+ def test_request_info_always_passes_validation(self) -> None:
280
+ env = ReplicaLabEnv()
281
+ env.reset(seed=42)
282
+ result = env.step(_request_info_action())
283
+
284
+ assert result.info.error is None
285
+ assert result.done is False
286
+
287
+
288
+ # ---------------------------------------------------------------------------
289
+ # TST 02 — valid step advances round, terminal path
290
+ # ---------------------------------------------------------------------------
291
+
292
+
293
+ class TestStep:
294
+ """TST 02: step() advances rounds and terminal path returns correct shape."""
295
+
296
+ def test_step_advances_round_number(self) -> None:
297
+ env = ReplicaLabEnv()
298
+ scenario = _scenario()
299
+ env.reset(seed=42)
300
+
301
+ action = _good_action(scenario)
302
+ result = env.step(action)
303
+
304
+ assert env.state().round_number == 1
305
+ assert result.done is False
306
+ assert result.reward == 0.0
307
+
308
+ def test_step_returns_observations(self) -> None:
309
+ env = ReplicaLabEnv()
310
+ scenario = _scenario()
311
+ env.reset(seed=42)
312
+
313
+ result = env.step(_good_action(scenario))
314
+
315
+ assert result.observation is not None
316
+ assert result.observation.scientist is not None
317
+ assert result.observation.lab_manager is not None
318
+ assert result.observation.scientist.round_number == 1
319
+
320
+ def test_step_records_conversation_history(self) -> None:
321
+ env = ReplicaLabEnv()
322
+ scenario = _scenario()
323
+ env.reset(seed=42)
324
+
325
+ env.step(_good_action(scenario))
326
+
327
+ s = env.state()
328
+ # Should have 2 entries: scientist + lab manager
329
+ assert len(s.conversation_history) == 2
330
+ assert s.conversation_history[0].role == "scientist"
331
+ assert s.conversation_history[1].role == "lab_manager"
332
+
333
+ def test_accept_with_protocol_terminates(self) -> None:
334
+ """Scientist accept with an existing protocol → done."""
335
+ env = ReplicaLabEnv()
336
+ scenario = _scenario()
337
+ env.reset(seed=42)
338
+
339
+ # First propose a protocol
340
+ env.step(_good_action(scenario))
341
+
342
+ # Then accept
343
+ result = env.step(_accept_action())
344
+
345
+ assert result.done is True
346
+ assert result.info.agreement_reached is True
347
+
348
+ def test_accept_terminal_step_has_real_reward(self) -> None:
349
+ """ENV 06: terminal accept computes real judge scores, not stub 0.8."""
350
+ env = ReplicaLabEnv()
351
+ scenario = _scenario()
352
+ env.reset(seed=42)
353
+
354
+ env.step(_good_action(scenario))
355
+ result = env.step(_accept_action())
356
+
357
+ assert result.done is True
358
+ assert result.reward > 0.0
359
+ assert result.info.reward_breakdown is not None
360
+
361
+ rb = result.info.reward_breakdown
362
+ assert 0.0 <= rb.rigor <= 1.0
363
+ assert 0.0 <= rb.feasibility <= 1.0
364
+ assert 0.0 <= rb.fidelity <= 1.0
365
+ # Verify it's not the old stub 0.8
366
+ assert not (rb.rigor == 0.8 and rb.feasibility == 0.8 and rb.fidelity == 0.8)
367
+
368
+ def test_max_rounds_terminates(self) -> None:
369
+ """Reaching max_rounds terminates without agreement."""
370
+ env = ReplicaLabEnv()
371
+ scenario = _scenario()
372
+ env.reset(seed=42)
373
+
374
+ max_r = env.state().max_rounds
375
+ for i in range(max_r):
376
+ result = env.step(_good_action(scenario))
377
+
378
+ assert result.done is True
379
+ assert result.info.agreement_reached is False
380
+ assert result.reward == 0.0
381
+
382
+ def test_step_info_has_round_and_episode_id(self) -> None:
383
+ env = ReplicaLabEnv()
384
+ scenario = _scenario()
385
+ env.reset(seed=42)
386
+
387
+ result = env.step(_good_action(scenario))
388
+
389
+ assert result.info.round == 1
390
+ assert result.info.episode_id == env.episode_id()
391
+
392
+ def test_full_episode_propose_then_accept(self) -> None:
393
+ """Full 2-step episode: propose → accept."""
394
+ env = ReplicaLabEnv()
395
+ scenario = _scenario("ml_benchmark", "easy")
396
+ env.reset(seed=42, scenario="ml_benchmark", difficulty="easy")
397
+
398
+ r1 = env.step(_good_action(scenario))
399
+ assert not r1.done
400
+
401
+ r2 = env.step(_accept_action())
402
+ assert r2.done
403
+ assert r2.info.agreement_reached
404
+ assert r2.reward > 0
405
+
406
+
407
+ # ---------------------------------------------------------------------------
408
+ # ENV 07 — state() returns deep snapshot
409
+ # ---------------------------------------------------------------------------
410
+
411
+
412
+ class TestStateSnapshot:
413
+ """ENV 07: state() returns a deep copy, not a reference."""
414
+
415
+ def test_state_is_deep_copy(self) -> None:
416
+ env = ReplicaLabEnv()
417
+ env.reset(seed=42)
418
+
419
+ s1 = env.state()
420
+ s1.round_number = 999 # mutate the snapshot
421
+
422
+ s2 = env.state()
423
+ assert s2.round_number == 0 # env state unaffected
424
+
425
+ def test_state_history_is_independent(self) -> None:
426
+ env = ReplicaLabEnv()
427
+ scenario = _scenario()
428
+ env.reset(seed=42)
429
+ env.step(_good_action(scenario))
430
+
431
+ s1 = env.state()
432
+ original_len = len(s1.conversation_history)
433
+ s1.conversation_history.clear()
434
+
435
+ s2 = env.state()
436
+ assert len(s2.conversation_history) == original_len
437
+
438
+
439
+ # ---------------------------------------------------------------------------
440
+ # ENV 08 — close() and _ensure_open()
441
+ # ---------------------------------------------------------------------------
442
+
443
+
444
+ class TestCloseReopen:
445
+ """ENV 08: close/reopen lifecycle."""
446
+
447
+ def test_close_is_idempotent(self) -> None:
448
+ env = ReplicaLabEnv()
449
+ env.reset(seed=1)
450
+ env.close()
451
+ env.close() # should not raise
452
+
453
+ def test_step_after_close_raises(self) -> None:
454
+ env = ReplicaLabEnv()
455
+ scenario = _scenario()
456
+ env.reset(seed=1)
457
+ env.close()
458
+
459
+ with pytest.raises(RuntimeError, match="closed"):
460
+ env.step(_good_action(scenario))
461
+
462
+ def test_reset_reopens_closed_env(self) -> None:
463
+ env = ReplicaLabEnv()
464
+ env.reset(seed=1)
465
+ env.close()
466
+
467
+ # reset should reopen
468
+ obs = env.reset(seed=2)
469
+ assert obs.scientist is not None
470
+
471
+ # step should work again
472
+ scenario = _scenario()
473
+ result = env.step(_good_action(scenario))
474
+ assert result.info.error is None
475
+
476
+
477
+ # ---------------------------------------------------------------------------
478
+ # JDG 04-05 — rubric unit tests
479
+ # ---------------------------------------------------------------------------
480
+
481
+
482
+ class TestRubric:
483
+ """JDG 04-05: compute_total_reward and build_reward_breakdown."""
484
+
485
+ def test_compute_total_reward_formula(self) -> None:
486
+ """10 × rigor × feasibility × fidelity + bonuses − penalties."""
487
+ rb = RewardBreakdown(
488
+ rigor=1.0,
489
+ feasibility=1.0,
490
+ fidelity=1.0,
491
+ efficiency_bonus=0.5,
492
+ communication_bonus=0.0,
493
+ penalties={},
494
+ )
495
+ total = compute_total_reward(rb)
496
+ assert total == 10.5 # 10*1*1*1 + 0.5
497
+
498
+ def test_compute_total_reward_with_penalties(self) -> None:
499
+ rb = RewardBreakdown(
500
+ rigor=0.8,
501
+ feasibility=0.9,
502
+ fidelity=0.7,
503
+ efficiency_bonus=0.0,
504
+ communication_bonus=0.0,
505
+ penalties={"timeout": 1.0, "invalid": 0.5},
506
+ )
507
+ expected = 10 * 0.8 * 0.9 * 0.7 - 1.5 # 5.04 - 1.5 = 3.54
508
+ assert abs(compute_total_reward(rb) - expected) < 0.001
509
+
510
+ def test_compute_total_reward_zero_scores(self) -> None:
511
+ rb = RewardBreakdown(rigor=0.0, feasibility=0.5, fidelity=0.5)
512
+ assert compute_total_reward(rb) == 0.0
513
+
514
+ def test_build_reward_breakdown_returns_valid_scores(self) -> None:
515
+ scenario = _scenario("ml_benchmark", "easy")
516
+ protocol = _good_protocol(scenario)
517
+
518
+ breakdown = build_reward_breakdown(
519
+ protocol=protocol,
520
+ scenario=scenario,
521
+ rounds_used=1,
522
+ max_rounds=6,
523
+ )
524
+
525
+ assert 0.0 <= breakdown.rigor <= 1.0
526
+ assert 0.0 <= breakdown.feasibility <= 1.0
527
+ assert 0.0 <= breakdown.fidelity <= 1.0
528
+ assert breakdown.efficiency_bonus >= 0.0
529
+
530
+ def test_build_reward_breakdown_efficiency_bonus(self) -> None:
531
+ """Finishing in fewer rounds gives a higher bonus."""
532
+ scenario = _scenario()
533
+ protocol = _good_protocol(scenario)
534
+
535
+ fast = build_reward_breakdown(protocol, scenario, rounds_used=1, max_rounds=6)
536
+ slow = build_reward_breakdown(protocol, scenario, rounds_used=5, max_rounds=6)
537
+
538
+ assert fast.efficiency_bonus > slow.efficiency_bonus
539
+
540
+ def test_build_reward_breakdown_is_deterministic(self) -> None:
541
+ scenario = _scenario("finance_trading", "medium")
542
+ protocol = _good_protocol(scenario)
543
+
544
+ b1 = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
545
+ b2 = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
546
+
547
+ assert b1.rigor == b2.rigor
548
+ assert b1.feasibility == b2.feasibility
549
+ assert b1.fidelity == b2.fidelity
550
+ assert b1.efficiency_bonus == b2.efficiency_bonus
551
+
552
+ def test_total_reward_matches_manual_calculation(self) -> None:
553
+ scenario = _scenario("math_reasoning", "easy")
554
+ protocol = _good_protocol(scenario)
555
+
556
+ breakdown = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
557
+ total = compute_total_reward(breakdown)
558
+ expected = (
559
+ 10.0 * breakdown.rigor * breakdown.feasibility * breakdown.fidelity
560
+ + breakdown.efficiency_bonus
561
+ + breakdown.communication_bonus
562
+ - sum(breakdown.penalties.values())
563
+ )
564
+ assert abs(total - expected) < 0.0001
565
+
566
+
567
+ # ---------------------------------------------------------------------------
568
+ # ENV 06 — terminal reward wiring
569
+ # ---------------------------------------------------------------------------
570
+
571
+
572
+ class TestEnvReward:
573
+ """ENV 06: real judge scoring at terminal steps."""
574
+
575
+ def test_agreement_terminal_has_breakdown_notes_verdict(self) -> None:
576
+ env = ReplicaLabEnv()
577
+ scenario = _scenario()
578
+ env.reset(seed=42)
579
+
580
+ env.step(_good_action(scenario))
581
+ result = env.step(_accept_action())
582
+
583
+ assert result.done
584
+ assert result.info.reward_breakdown is not None
585
+ assert result.info.judge_notes is not None
586
+ assert result.info.verdict == "accept"
587
+ assert "rigor" in result.info.judge_notes
588
+
589
+ def test_no_agreement_terminal_is_deterministic(self) -> None:
590
+ def run_timeout_episode():
591
+ env = ReplicaLabEnv()
592
+ scenario = _scenario()
593
+ env.reset(seed=42)
594
+ max_r = env.state().max_rounds
595
+ result = None
596
+ for _ in range(max_r):
597
+ result = env.step(_good_action(scenario))
598
+ return result
599
+
600
+ r1 = run_timeout_episode()
601
+ r2 = run_timeout_episode()
602
+
603
+ assert r1.reward == r2.reward
604
+ assert r1.info.verdict == r2.info.verdict
605
+
606
+ def test_timeout_verdict(self) -> None:
607
+ env = ReplicaLabEnv()
608
+ scenario = _scenario()
609
+ env.reset(seed=42)
610
+
611
+ max_r = env.state().max_rounds
612
+ result = None
613
+ for _ in range(max_r):
614
+ result = env.step(_good_action(scenario))
615
+
616
+ assert result.done
617
+ assert result.info.verdict == "timeout"
618
+ assert result.info.reward_breakdown is not None
619
+ assert result.reward == 0.0
620
+
621
+ def test_episode_state_stores_final_scores(self) -> None:
622
+ env = ReplicaLabEnv()
623
+ scenario = _scenario()
624
+ env.reset(seed=42)
625
+
626
+ env.step(_good_action(scenario))
627
+ env.step(_accept_action())
628
+
629
+ s = env.state()
630
+ assert s.done
631
+ assert s.agreement_reached
632
+ assert s.rigor_score > 0.0
633
+ assert s.feasibility_score > 0.0
634
+ assert s.fidelity_score > 0.0
635
+ assert s.reward > 0.0
tests/test_reward.py CHANGED
@@ -3,9 +3,15 @@
3
  from __future__ import annotations
4
 
5
  from replicalab.agents.lab_manager_policy import check_feasibility
6
- from replicalab.models import Protocol
7
  from replicalab.scenarios import generate_scenario
8
- from replicalab.scoring import score_feasibility, score_fidelity, score_rigor
 
 
 
 
 
 
9
 
10
 
11
  # ---------------------------------------------------------------------------
@@ -301,3 +307,100 @@ def test_good_protocol_dominates_bad_on_rigor_and_fidelity() -> None:
301
 
302
  assert score_rigor(good, scenario) > score_rigor(bad, scenario)
303
  assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  from __future__ import annotations
4
 
5
  from replicalab.agents.lab_manager_policy import check_feasibility
6
+ from replicalab.models import Protocol, RewardBreakdown
7
  from replicalab.scenarios import generate_scenario
8
+ from replicalab.scoring import (
9
+ build_reward_breakdown,
10
+ compute_total_reward,
11
+ score_feasibility,
12
+ score_fidelity,
13
+ score_rigor,
14
+ )
15
 
16
 
17
  # ---------------------------------------------------------------------------
 
307
 
308
  assert score_rigor(good, scenario) > score_rigor(bad, scenario)
309
  assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)
310
+
311
+
312
+ # ---------------------------------------------------------------------------
313
+ # JDG 04 — compute_total_reward
314
+ # ---------------------------------------------------------------------------
315
+
316
+
317
+ def test_total_reward_perfect_beats_broken() -> None:
318
+ """A well-aligned protocol earns a higher total reward than a bad one."""
319
+ scenario = _scenario("ml_benchmark", "easy")
320
+ good = _good_protocol(scenario)
321
+ bad = _bad_protocol()
322
+
323
+ good_bd = build_reward_breakdown(good, scenario, rounds_used=1, max_rounds=6)
324
+ bad_bd = build_reward_breakdown(bad, scenario, rounds_used=1, max_rounds=6)
325
+
326
+ assert compute_total_reward(good_bd) > compute_total_reward(bad_bd)
327
+
328
+
329
+ def test_zero_feasibility_zeroes_base() -> None:
330
+ """If any component is 0, the multiplicative base is 0."""
331
+ rb = RewardBreakdown(rigor=1.0, feasibility=0.0, fidelity=1.0)
332
+ assert compute_total_reward(rb) == 0.0
333
+
334
+
335
+ def test_efficiency_bonus_higher_when_faster() -> None:
336
+ """Finishing in fewer rounds yields a higher total reward."""
337
+ scenario = _scenario()
338
+ protocol = _good_protocol(scenario)
339
+
340
+ fast = build_reward_breakdown(protocol, scenario, rounds_used=1, max_rounds=6)
341
+ slow = build_reward_breakdown(protocol, scenario, rounds_used=5, max_rounds=6)
342
+
343
+ assert compute_total_reward(fast) > compute_total_reward(slow)
344
+
345
+
346
+ def test_penalty_subtraction_exact() -> None:
347
+ """Named penalties subtract exactly from the total."""
348
+ rb = RewardBreakdown(
349
+ rigor=1.0,
350
+ feasibility=1.0,
351
+ fidelity=1.0,
352
+ penalties={"invalid_tool_use": 2.0, "unsupported_claim": 0.5},
353
+ )
354
+ total = compute_total_reward(rb)
355
+ assert total == 7.5 # 10*1*1*1 - 2.5
356
+
357
+
358
+ def test_total_reward_clamps_at_zero() -> None:
359
+ """Massive penalties cannot push the total below 0."""
360
+ rb = RewardBreakdown(
361
+ rigor=0.1,
362
+ feasibility=0.1,
363
+ fidelity=0.1,
364
+ penalties={"massive_penalty": 50.0},
365
+ )
366
+ assert compute_total_reward(rb) == 0.0
367
+
368
+
369
+ def test_breakdown_determinism() -> None:
370
+ """Same inputs always produce the same total reward."""
371
+ scenario = _scenario("finance_trading", "medium")
372
+ protocol = _good_protocol(scenario)
373
+
374
+ b1 = build_reward_breakdown(protocol, scenario, rounds_used=3, max_rounds=6)
375
+ b2 = build_reward_breakdown(protocol, scenario, rounds_used=3, max_rounds=6)
376
+
377
+ assert compute_total_reward(b1) == compute_total_reward(b2)
378
+
379
+
380
+ # ---------------------------------------------------------------------------
381
+ # JDG 05 — build_reward_breakdown
382
+ # ---------------------------------------------------------------------------
383
+
384
+
385
+ def test_breakdown_accepts_external_penalties() -> None:
386
+ """Callers can inject named penalty keys via the penalties parameter."""
387
+ scenario = _scenario()
388
+ protocol = _good_protocol(scenario)
389
+
390
+ bd = build_reward_breakdown(
391
+ protocol, scenario, rounds_used=2, max_rounds=6,
392
+ penalties={"invalid_tool_use": 1.0},
393
+ )
394
+
395
+ assert "invalid_tool_use" in bd.penalties
396
+ assert bd.penalties["invalid_tool_use"] == 1.0
397
+
398
+
399
+ def test_breakdown_no_penalties_by_default() -> None:
400
+ """Without external penalties, the dict is empty."""
401
+ scenario = _scenario()
402
+ protocol = _good_protocol(scenario)
403
+
404
+ bd = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
405
+
406
+ assert bd.penalties == {}
tests/test_scientist_policy.py CHANGED
@@ -556,3 +556,347 @@ def test_baseline_scientist_finishes_stub_episode_without_crashing() -> None:
556
 
557
  assert second_step.done is True
558
  assert second_step.info.agreement_reached is True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
556
 
557
  assert second_step.done is True
558
  assert second_step.info.agreement_reached is True
559
+
560
+
561
+ # ---------------------------------------------------------------------------
562
+ # AGT 08 — Extended prompt, parser, formatter, and baseline coverage
563
+ # ---------------------------------------------------------------------------
564
+
565
+
566
+ # --- Parser happy paths ---
567
+
568
+
569
+ def test_parse_scientist_output_accepts_propose_protocol() -> None:
570
+ raw_text = """{
571
+ "action_type": "propose_protocol",
572
+ "sample_size": 48,
573
+ "controls": ["vehicle_control", "positive_control"],
574
+ "technique": "wst1_assay",
575
+ "duration_days": 5,
576
+ "required_equipment": ["plate_reader"],
577
+ "required_reagents": ["wst1", "dmso"],
578
+ "questions": [],
579
+ "rationale": "Standard viability assay with two controls."
580
+ }"""
581
+
582
+ action = parse_scientist_output(raw_text)
583
+
584
+ assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
585
+ assert action.sample_size == 48
586
+ assert action.technique == "wst1_assay"
587
+ assert action.controls == ["vehicle_control", "positive_control"]
588
+ assert action.questions == []
589
+
590
+
591
+ def test_parse_scientist_output_accepts_accept_action() -> None:
592
+ raw_text = """{
593
+ "action_type": "accept",
594
+ "sample_size": 0,
595
+ "controls": [],
596
+ "technique": "",
597
+ "duration_days": 0,
598
+ "required_equipment": [],
599
+ "required_reagents": [],
600
+ "questions": [],
601
+ "rationale": ""
602
+ }"""
603
+
604
+ action = parse_scientist_output(raw_text)
605
+
606
+ assert action.action_type is ScientistActionType.ACCEPT
607
+ assert action.sample_size == 0
608
+ assert action.rationale == ""
609
+
610
+
611
+ def test_parse_scientist_output_accepts_prose_wrapped_json() -> None:
612
+ raw_text = (
613
+ "After reviewing the constraints I think a request is in order.\n\n"
614
+ '{"action_type": "request_info", "sample_size": 0, '
615
+ '"controls": [], "technique": "", "duration_days": 0, '
616
+ '"required_equipment": [], "required_reagents": [], '
617
+ '"questions": ["Is the GPU available?"], "rationale": ""}\n\n'
618
+ "That should clarify the compute situation."
619
+ )
620
+
621
+ action = parse_scientist_output(raw_text)
622
+
623
+ assert action.action_type is ScientistActionType.REQUEST_INFO
624
+ assert action.questions == ["Is the GPU available?"]
625
+
626
+
627
+ # --- Parser edge cases ---
628
+
629
+
630
+ def test_parse_scientist_output_raises_on_empty_string() -> None:
631
+ with pytest.raises(ScientistOutputParseError) as exc_info:
632
+ parse_scientist_output("")
633
+
634
+ assert exc_info.value.code == "no_json"
635
+
636
+
637
+ def test_parse_scientist_output_raises_on_whitespace_only() -> None:
638
+ with pytest.raises(ScientistOutputParseError) as exc_info:
639
+ parse_scientist_output(" \n\t ")
640
+
641
+ assert exc_info.value.code == "no_json"
642
+
643
+
644
+ def test_parse_scientist_output_raises_on_json_list() -> None:
645
+ # The parser's brace extractor finds the inner object from the list,
646
+ # so this surfaces as an invalid_action (missing required fields)
647
+ # rather than an invalid_json error.
648
+ with pytest.raises(ScientistOutputParseError) as exc_info:
649
+ parse_scientist_output('[{"action_type": "accept"}]')
650
+
651
+ assert exc_info.value.code == "invalid_action"
652
+
653
+
654
+ def test_parse_scientist_output_raises_on_extra_forbidden_keys() -> None:
655
+ raw_text = """{
656
+ "action_type": "accept",
657
+ "sample_size": 0,
658
+ "controls": [],
659
+ "technique": "",
660
+ "duration_days": 0,
661
+ "required_equipment": [],
662
+ "required_reagents": [],
663
+ "questions": [],
664
+ "rationale": "",
665
+ "secret_field": "should not be here"
666
+ }"""
667
+
668
+ with pytest.raises(ScientistOutputParseError) as exc_info:
669
+ parse_scientist_output(raw_text)
670
+
671
+ assert exc_info.value.code == "invalid_action"
672
+ assert exc_info.value.parsed_payload is not None
673
+ assert "secret_field" in exc_info.value.parsed_payload
674
+
675
+
676
+ def test_parse_error_to_dict_serialization() -> None:
677
+ try:
678
+ parse_scientist_output("no json here")
679
+ except ScientistOutputParseError as exc:
680
+ result = exc.to_dict()
681
+ assert result["code"] == "no_json"
682
+ assert result["raw_text"] == "no json here"
683
+ assert result["parsed_payload"] is None
684
+ assert "message" in result
685
+ else:
686
+ pytest.fail("Expected ScientistOutputParseError")
687
+
688
+
689
+ def test_parse_error_to_dict_with_parsed_payload() -> None:
690
+ raw_text = """{
691
+ "action_type": "request_info",
692
+ "sample_size": 0,
693
+ "controls": [],
694
+ "technique": "",
695
+ "duration_days": 0,
696
+ "required_equipment": [],
697
+ "required_reagents": [],
698
+ "questions": [],
699
+ "rationale": ""
700
+ }"""
701
+ try:
702
+ parse_scientist_output(raw_text)
703
+ except ScientistOutputParseError as exc:
704
+ result = exc.to_dict()
705
+ assert result["code"] == "invalid_action"
706
+ assert result["parsed_payload"] is not None
707
+ assert result["parsed_payload"]["action_type"] == "request_info"
708
+ else:
709
+ pytest.fail("Expected ScientistOutputParseError")
710
+
711
+
712
+ # --- System prompt: domain coverage ---
713
+
714
+
715
+ def test_system_prompt_math_domain() -> None:
716
+ scenario = generate_scenario(seed=10, template="math_reasoning", difficulty="easy")
717
+ prompt = build_scientist_system_prompt(scenario)
718
+
719
+ assert "Domain: mathematics" in prompt
720
+ assert scenario.task_summary in prompt
721
+ assert "You are the Scientist agent" in prompt
722
+
723
+
724
+ def test_system_prompt_finance_domain() -> None:
725
+ scenario = generate_scenario(seed=10, template="finance_trading", difficulty="easy")
726
+ prompt = build_scientist_system_prompt(scenario)
727
+
728
+ assert "Domain: finance_trading" in prompt
729
+ assert scenario.task_summary in prompt
730
+
731
+
732
+ def test_system_prompt_ml_domain() -> None:
733
+ scenario = generate_scenario(seed=10, template="ml_benchmark", difficulty="easy")
734
+ prompt = build_scientist_system_prompt(scenario)
735
+
736
+ assert "Domain: machine_learning" in prompt
737
+ assert scenario.task_summary in prompt
738
+
739
+
740
+ def test_system_prompt_accepts_dict_input() -> None:
741
+ scenario = generate_scenario(seed=5, template="math_reasoning", difficulty="easy")
742
+ pack_dict = scenario.model_dump()
743
+
744
+ prompt = build_scientist_system_prompt(pack_dict)
745
+
746
+ assert "You are the Scientist agent" in prompt
747
+ assert scenario.task_summary in prompt
748
+ assert "Domain: mathematics" in prompt
749
+
750
+
751
+ # --- System prompt: bounded-tool policy assertions ---
752
+
753
+
754
+ def test_system_prompt_contains_bounded_tool_policy() -> None:
755
+ scenario = generate_scenario(seed=1, template="math_reasoning", difficulty="easy")
756
+ prompt = build_scientist_system_prompt(scenario)
757
+
758
+ assert "search_evidence" in prompt
759
+ assert "run_code_check" in prompt
760
+ assert "inspect_image" in prompt
761
+
762
+
763
+ def test_system_prompt_bounded_tool_policy_rules() -> None:
764
+ scenario = generate_scenario(seed=1, template="ml_benchmark", difficulty="medium")
765
+ prompt = build_scientist_system_prompt(scenario)
766
+
767
+ assert "No unrestricted web browsing" in prompt
768
+ assert "No audio" in prompt
769
+ assert "do not override constraints" in prompt or "Tools do not override constraints" in prompt
770
+
771
+
772
+ def test_system_prompt_bounded_tool_policy_present_in_all_domains() -> None:
773
+ for template in ("math_reasoning", "ml_benchmark", "finance_trading"):
774
+ scenario = generate_scenario(seed=42, template=template, difficulty="easy")
775
+ prompt = build_scientist_system_prompt(scenario)
776
+
777
+ assert "Bounded tool policy" in prompt, f"Missing in {template}"
778
+ assert "search_evidence" in prompt, f"Missing search_evidence in {template}"
779
+ assert "run_code_check" in prompt, f"Missing run_code_check in {template}"
780
+ assert "inspect_image" in prompt, f"Missing inspect_image in {template}"
781
+
782
+
783
+ # --- System prompt: role-boundary assertions ---
784
+
785
+
786
+ def test_system_prompt_contains_role_boundaries() -> None:
787
+ scenario = generate_scenario(seed=1, template="math_reasoning", difficulty="easy")
788
+ prompt = build_scientist_system_prompt(scenario)
789
+
790
+ assert "do not invent resources" in prompt
791
+ assert "do not assume access to hidden ground truth" in prompt.lower() or \
792
+ "hidden ground truth" in prompt
793
+
794
+
795
+ def test_system_prompt_contains_output_contract() -> None:
796
+ scenario = generate_scenario(seed=1, template="math_reasoning", difficulty="easy")
797
+ prompt = build_scientist_system_prompt(scenario)
798
+
799
+ assert "Output contract" in prompt
800
+ assert "exactly one JSON object" in prompt
801
+ assert "no extra keys" in prompt
802
+
803
+
804
+ # --- Observation formatter edge cases ---
805
+
806
+
807
+ def test_format_observation_final_round() -> None:
808
+ obs = _base_observation(round_number=5, max_rounds=6)
809
+ result = format_scientist_observation(obs)
810
+
811
+ assert "Round 5 of 6" in result
812
+ assert "Respond with exactly one JSON" in result
813
+
814
+
815
+ def test_format_observation_protocol_with_empty_lists() -> None:
816
+ protocol = Protocol(
817
+ sample_size=1,
818
+ controls=[],
819
+ technique="minimal_check",
820
+ duration_days=1,
821
+ required_equipment=[],
822
+ required_reagents=[],
823
+ rationale="Minimal protocol.",
824
+ )
825
+ obs = _base_observation(current_protocol=protocol, round_number=1)
826
+ result = format_scientist_observation(obs)
827
+
828
+ assert "Current protocol:" in result
829
+ assert "technique: minimal_check" in result
830
+ assert "controls: (none)" in result
831
+ assert "required_equipment: (none)" in result
832
+ assert "required_reagents: (none)" in result
833
+
834
+
835
+ # --- Baseline: domain inference ---
836
+
837
+
838
+ def test_baseline_scientist_infers_ml_domain() -> None:
839
+ obs = _base_observation(
840
+ paper_title="Reproducing CIFAR-10 accuracy with ResNet",
841
+ paper_method="Train on CIFAR dataset with GPU",
842
+ experiment_goal="Match the published benchmark accuracy.",
843
+ )
844
+ action = build_baseline_scientist_action(obs)
845
+
846
+ assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
847
+ assert action.technique == "published_split_replication"
848
+
849
+
850
+ def test_baseline_scientist_infers_finance_domain() -> None:
851
+ obs = _base_observation(
852
+ paper_title="Offline backtest of SPY mean-reversion",
853
+ paper_method="Daily bar backtest with slippage modeling",
854
+ experiment_goal="Evaluate Sharpe ratio under drawdown limits.",
855
+ )
856
+ action = build_baseline_scientist_action(obs)
857
+
858
+ assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
859
+ assert action.technique == "offline_backtest_workflow"
860
+
861
+
862
+ def test_baseline_scientist_infers_math_domain() -> None:
863
+ obs = _base_observation(
864
+ paper_title="Planning a proof of AM-GM inequality",
865
+ paper_method="Algebraic manipulation with induction.",
866
+ experiment_goal="Verify the proof outline.",
867
+ )
868
+ action = build_baseline_scientist_action(obs)
869
+
870
+ assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
871
+ assert action.technique == "structured_proof_outline"
872
+
873
+
874
+ # --- Baseline: forced accept at final round ---
875
+
876
+
877
+ def test_baseline_scientist_accepts_at_final_round_even_with_blocker() -> None:
878
+ obs = _base_observation(
879
+ current_protocol=Protocol(
880
+ sample_size=20,
881
+ controls=["ctrl"],
882
+ technique="method_a",
883
+ duration_days=5,
884
+ required_equipment=[],
885
+ required_reagents=[],
886
+ rationale="Full scope plan.",
887
+ ),
888
+ conversation_history=[
889
+ ConversationEntry(
890
+ role="lab_manager",
891
+ message="Budget is tight and equipment is booked.",
892
+ round_number=4,
893
+ action_type="suggest_alternative",
894
+ ),
895
+ ],
896
+ round_number=5,
897
+ max_rounds=6,
898
+ )
899
+
900
+ action = build_baseline_scientist_action(obs)
901
+
902
+ assert action.action_type is ScientistActionType.ACCEPT
tests/test_server.py ADDED
@@ -0,0 +1,604 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Server endpoint tests.
2
+
3
+ API 02 adds POST /reset endpoint tests.
4
+ API 04 adds a smoke test for GET /scenarios.
5
+ API 13 adds CORS middleware verification tests.
6
+ API 03 adds POST /step endpoint tests.
7
+ API 06 adds WebSocket session handler tests.
8
+ API 07 adds idle-timeout and graceful disconnect cleanup tests.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import json
14
+ import time
15
+ from unittest.mock import patch
16
+
17
+ import pytest
18
+ from fastapi.testclient import TestClient
19
+ from starlette.websockets import WebSocketDisconnect
20
+
21
+ from server.app import app
22
+
23
+ _EXPECTED_FAMILIES = {"math_reasoning", "ml_benchmark", "finance_trading"}
24
+ _EXPECTED_DIFFICULTIES = ["easy", "medium", "hard"]
25
+
26
+
27
+ @pytest.fixture()
28
+ def client():
29
+ return TestClient(app)
30
+
31
+
32
+ class TestScenariosEndpoint:
33
+ """GET /scenarios — API 04."""
34
+
35
+ def test_returns_200(self, client: TestClient):
36
+ resp = client.get("/scenarios")
37
+ assert resp.status_code == 200
38
+
39
+ def test_response_has_scenarios_key(self, client: TestClient):
40
+ data = client.get("/scenarios").json()
41
+ assert "scenarios" in data
42
+ assert isinstance(data["scenarios"], list)
43
+
44
+ def test_all_families_present(self, client: TestClient):
45
+ data = client.get("/scenarios").json()
46
+ families = {s["family"] for s in data["scenarios"]}
47
+ assert families == _EXPECTED_FAMILIES
48
+
49
+ def test_each_family_has_difficulties(self, client: TestClient):
50
+ data = client.get("/scenarios").json()
51
+ for entry in data["scenarios"]:
52
+ assert entry["difficulties"] == _EXPECTED_DIFFICULTIES
53
+
54
+ def test_no_extra_keys(self, client: TestClient):
55
+ data = client.get("/scenarios").json()
56
+ for entry in data["scenarios"]:
57
+ assert set(entry.keys()) == {"family", "difficulties"}
58
+
59
+
60
+ # ---------------------------------------------------------------------------
61
+ # POST /reset — API 02
62
+ # ---------------------------------------------------------------------------
63
+
64
+
65
+ class TestCorsConfiguration:
66
+ """API 13: CORS middleware for local frontend and HF Spaces."""
67
+
68
+ def test_preflight_allows_localhost_vite_origin(self, client: TestClient) -> None:
69
+ resp = client.options(
70
+ "/reset",
71
+ headers={
72
+ "Origin": "http://localhost:5173",
73
+ "Access-Control-Request-Method": "POST",
74
+ },
75
+ )
76
+
77
+ assert resp.status_code == 200
78
+ assert resp.headers["access-control-allow-origin"] == "http://localhost:5173"
79
+ assert resp.headers["access-control-allow-credentials"] == "true"
80
+
81
+ def test_preflight_allows_hf_space_origin(self, client: TestClient) -> None:
82
+ origin = "https://replicalab-demo.hf.space"
83
+ resp = client.options(
84
+ "/health",
85
+ headers={
86
+ "Origin": origin,
87
+ "Access-Control-Request-Method": "GET",
88
+ },
89
+ )
90
+
91
+ assert resp.status_code == 200
92
+ assert resp.headers["access-control-allow-origin"] == origin
93
+ assert resp.headers["access-control-allow-credentials"] == "true"
94
+
95
+ def test_preflight_rejects_unconfigured_origin(self, client: TestClient) -> None:
96
+ resp = client.options(
97
+ "/reset",
98
+ headers={
99
+ "Origin": "https://evil.example.com",
100
+ "Access-Control-Request-Method": "POST",
101
+ },
102
+ )
103
+
104
+ assert resp.status_code == 400
105
+ assert "access-control-allow-origin" not in resp.headers
106
+
107
+
108
+ class TestResetEndpoint:
109
+ """POST /reset — API 02."""
110
+
111
+ def test_reset_returns_200_with_expected_keys(self, client: TestClient) -> None:
112
+ resp = client.post("/reset", json={"seed": 1})
113
+ assert resp.status_code == 200
114
+ data = resp.json()
115
+ assert "session_id" in data
116
+ assert "episode_id" in data
117
+ assert "observation" in data
118
+
119
+ def test_reset_observation_has_both_roles(self, client: TestClient) -> None:
120
+ data = client.post("/reset", json={"seed": 1}).json()
121
+ obs = data["observation"]
122
+ assert "scientist" in obs
123
+ assert "lab_manager" in obs
124
+ assert obs["scientist"]["paper_title"]
125
+ assert obs["lab_manager"]["budget_total"] > 0
126
+
127
+ def test_reset_with_explicit_session_id_reuses_slot(
128
+ self, client: TestClient
129
+ ) -> None:
130
+ """Passing session_id reuses the same slot and returns the same id."""
131
+ sid = "my-fixed-session"
132
+ d1 = client.post("/reset", json={"seed": 1, "session_id": sid}).json()
133
+ assert d1["session_id"] == sid
134
+
135
+ d2 = client.post("/reset", json={"seed": 2, "session_id": sid}).json()
136
+ assert d2["session_id"] == sid
137
+ # New episode each time
138
+ assert d2["episode_id"] != d1["episode_id"]
139
+
140
+ def test_reset_reuse_closes_prior_env(self, client: TestClient) -> None:
141
+ """Resetting with the same session_id produces a fresh episode."""
142
+ sid = "reuse-session"
143
+ d1 = client.post("/reset", json={"seed": 10, "session_id": sid}).json()
144
+ ep1 = d1["episode_id"]
145
+
146
+ d2 = client.post("/reset", json={"seed": 20, "session_id": sid}).json()
147
+ ep2 = d2["episode_id"]
148
+
149
+ assert ep1 != ep2
150
+
151
+ def test_reset_default_params(self, client: TestClient) -> None:
152
+ """Omitting scenario and difficulty uses defaults without error."""
153
+ resp = client.post("/reset", json={"seed": 0})
154
+ assert resp.status_code == 200
155
+ data = resp.json()
156
+ assert data["observation"]["scientist"]["paper_title"]
157
+
158
+ def test_reset_custom_scenario_and_difficulty(self, client: TestClient) -> None:
159
+ for family in ("math_reasoning", "ml_benchmark", "finance_trading"):
160
+ for diff in ("easy", "medium", "hard"):
161
+ resp = client.post(
162
+ "/reset",
163
+ json={"seed": 42, "scenario": family, "difficulty": diff},
164
+ )
165
+ assert resp.status_code == 200, f"Failed for {family}/{diff}"
166
+ obs = resp.json()["observation"]
167
+ assert obs["scientist"]["paper_title"]
168
+ assert obs["lab_manager"]["budget_total"] > 0
169
+
170
+ def test_reset_deterministic_with_same_seed(self, client: TestClient) -> None:
171
+ """Same seed + scenario + difficulty → identical observations."""
172
+ params = {"seed": 99, "scenario": "math_reasoning", "difficulty": "medium"}
173
+ d1 = client.post("/reset", json=params).json()
174
+ d2 = client.post("/reset", json=params).json()
175
+
176
+ assert d1["observation"] == d2["observation"]
177
+ # Episode ids differ (new UUID each time)
178
+ assert d1["episode_id"] != d2["episode_id"]
179
+
180
+
181
+ # ---------------------------------------------------------------------------
182
+ # Helpers
183
+ # ---------------------------------------------------------------------------
184
+
185
+
186
+ def _reset(client: TestClient, **kwargs) -> dict:
187
+ """Reset and return the response JSON."""
188
+ payload = {"seed": 42, "scenario": "math_reasoning", "difficulty": "easy"}
189
+ payload.update(kwargs)
190
+ resp = client.post("/reset", json=payload)
191
+ assert resp.status_code == 200
192
+ return resp.json()
193
+
194
+
195
+ def _good_action_payload(client: TestClient) -> dict:
196
+ """Build a valid propose_protocol action payload from a fresh scenario."""
197
+ from replicalab.scenarios import generate_scenario
198
+
199
+ scenario = generate_scenario(seed=42, template="math_reasoning", difficulty="easy")
200
+ lab = scenario.lab_manager_observation
201
+ spec = scenario.hidden_reference_spec
202
+ return {
203
+ "action_type": "propose_protocol",
204
+ "sample_size": 10,
205
+ "controls": ["baseline", "ablation"],
206
+ "technique": spec.summary[:60] if spec.summary else "replication_plan",
207
+ "duration_days": max(1, min(2, lab.time_limit_days)),
208
+ "required_equipment": (
209
+ list(lab.equipment_available[:1]) if lab.equipment_available else []
210
+ ),
211
+ "required_reagents": (
212
+ list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else []
213
+ ),
214
+ "questions": [],
215
+ "rationale": (
216
+ f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
217
+ f"Target metric: {spec.target_metric}. "
218
+ f"Target value: {spec.target_value}. "
219
+ "Stay within budget and schedule."
220
+ ),
221
+ }
222
+
223
+
224
+ def _accept_action_payload() -> dict:
225
+ return {
226
+ "action_type": "accept",
227
+ "sample_size": 0,
228
+ "controls": [],
229
+ "technique": "",
230
+ "duration_days": 0,
231
+ "required_equipment": [],
232
+ "required_reagents": [],
233
+ "questions": [],
234
+ "rationale": "",
235
+ }
236
+
237
+
238
+ # ---------------------------------------------------------------------------
239
+ # POST /step — API 03
240
+ # ---------------------------------------------------------------------------
241
+
242
+
243
+ class TestStepEndpoint:
244
+ """POST /step — API 03."""
245
+
246
+ def test_reset_then_step_happy_path(self, client: TestClient) -> None:
247
+ """Reset, then step with a valid action → 200 with StepResult."""
248
+ reset_data = _reset(client)
249
+ session_id = reset_data["session_id"]
250
+
251
+ action = _good_action_payload(client)
252
+ resp = client.post("/step", json={"session_id": session_id, "action": action})
253
+
254
+ assert resp.status_code == 200
255
+ data = resp.json()
256
+ assert "observation" in data
257
+ assert "reward" in data
258
+ assert "done" in data
259
+ assert "info" in data
260
+ assert data["done"] is False
261
+ assert data["info"]["error"] is None
262
+
263
+ def test_step_invalid_session_returns_404(self, client: TestClient) -> None:
264
+ """Step with a non-existent session_id → 404."""
265
+ action = _good_action_payload(client)
266
+ resp = client.post(
267
+ "/step",
268
+ json={"session_id": "nonexistent-session-id", "action": action},
269
+ )
270
+
271
+ assert resp.status_code == 404
272
+ assert "Session not found" in resp.json()["detail"]
273
+
274
+ def test_terminal_step_returns_real_reward_breakdown(
275
+ self, client: TestClient
276
+ ) -> None:
277
+ """Propose → accept: terminal step has real reward_breakdown,
278
+ judge_notes, and verdict from the env (not stubs)."""
279
+ reset_data = _reset(client)
280
+ session_id = reset_data["session_id"]
281
+
282
+ # Step 1: propose
283
+ action = _good_action_payload(client)
284
+ resp1 = client.post("/step", json={"session_id": session_id, "action": action})
285
+ assert resp1.status_code == 200
286
+ assert resp1.json()["done"] is False
287
+
288
+ # Step 2: accept
289
+ resp2 = client.post(
290
+ "/step",
291
+ json={"session_id": session_id, "action": _accept_action_payload()},
292
+ )
293
+ assert resp2.status_code == 200
294
+ data = resp2.json()
295
+
296
+ assert data["done"] is True
297
+ assert data["reward"] > 0.0
298
+
299
+ info = data["info"]
300
+ assert info["agreement_reached"] is True
301
+ assert info["verdict"] == "accept"
302
+ assert info["judge_notes"] is not None
303
+ assert "rigor" in info["judge_notes"]
304
+
305
+ rb = info["reward_breakdown"]
306
+ assert rb is not None
307
+ assert 0.0 <= rb["rigor"] <= 1.0
308
+ assert 0.0 <= rb["feasibility"] <= 1.0
309
+ assert 0.0 <= rb["fidelity"] <= 1.0
310
+ # Verify it's not the old stub 0.8
311
+ assert not (rb["rigor"] == 0.8 and rb["feasibility"] == 0.8 and rb["fidelity"] == 0.8)
312
+
313
+ def test_semantic_invalid_action_returns_200_with_error(
314
+ self, client: TestClient
315
+ ) -> None:
316
+ """A semantically invalid action (e.g. duration=999) returns 200
317
+ with info.error set, not a crash or 422."""
318
+ reset_data = _reset(client)
319
+ session_id = reset_data["session_id"]
320
+
321
+ bad_action = {
322
+ "action_type": "propose_protocol",
323
+ "sample_size": 5,
324
+ "controls": ["baseline"],
325
+ "technique": "some technique",
326
+ "duration_days": 999,
327
+ "required_equipment": [],
328
+ "required_reagents": [],
329
+ "questions": [],
330
+ "rationale": "Duration is impossibly long for the lab time limit.",
331
+ }
332
+ resp = client.post(
333
+ "/step", json={"session_id": session_id, "action": bad_action}
334
+ )
335
+
336
+ assert resp.status_code == 200
337
+ data = resp.json()
338
+ assert data["done"] is False
339
+ assert data["info"]["error"] is not None
340
+ assert "Validation errors" in data["info"]["error"]
341
+
342
+ def test_replay_uses_real_judge_data(self, client: TestClient) -> None:
343
+ """After a terminal step, GET /replay/{episode_id} returns
344
+ real judge_notes and verdict, not stub values."""
345
+ reset_data = _reset(client)
346
+ session_id = reset_data["session_id"]
347
+ episode_id = reset_data["episode_id"]
348
+
349
+ # Propose then accept
350
+ action = _good_action_payload(client)
351
+ client.post("/step", json={"session_id": session_id, "action": action})
352
+ client.post(
353
+ "/step",
354
+ json={"session_id": session_id, "action": _accept_action_payload()},
355
+ )
356
+
357
+ # Fetch replay
358
+ resp = client.get(f"/replay/{episode_id}")
359
+ assert resp.status_code == 200
360
+ replay = resp.json()
361
+
362
+ assert replay["agreement_reached"] is True
363
+ assert "rigor" in replay["judge_notes"]
364
+ assert replay["verdict"] == "accept"
365
+ assert replay["reward_breakdown"] is not None
366
+ assert replay["total_reward"] > 0.0
367
+ # Not the old stub string
368
+ assert "Stub audit" not in replay["judge_notes"]
369
+
370
+
371
+ # ---------------------------------------------------------------------------
372
+ # WebSocket handler — API 06
373
+ # ---------------------------------------------------------------------------
374
+
375
+
376
+ def _ws_send_recv(ws, msg: dict) -> dict:
377
+ """Send a JSON message over the WebSocket and return the parsed response."""
378
+ ws.send_text(json.dumps(msg))
379
+ return json.loads(ws.receive_text())
380
+
381
+
382
+ class TestWebSocket:
383
+ """API 06: WebSocket session handler with isolated env per connection."""
384
+
385
+ # -- basic connectivity --------------------------------------------------
386
+
387
+ def test_ws_ping_pong(self, client: TestClient) -> None:
388
+ with client.websocket_connect("/ws") as ws:
389
+ resp = _ws_send_recv(ws, {"type": "ping"})
390
+ assert resp["type"] == "pong"
391
+
392
+ def test_ws_reset_returns_observation(self, client: TestClient) -> None:
393
+ with client.websocket_connect("/ws") as ws:
394
+ resp = _ws_send_recv(ws, {
395
+ "type": "reset", "seed": 42,
396
+ "scenario": "math_reasoning", "difficulty": "easy",
397
+ })
398
+ assert resp["type"] == "reset_ok"
399
+ assert resp["episode_id"]
400
+ obs = resp["observation"]
401
+ assert obs["scientist"]["paper_title"]
402
+ assert obs["lab_manager"]["budget_total"] > 0
403
+
404
+ def test_ws_step_returns_result(self, client: TestClient) -> None:
405
+ action = _good_action_payload(client)
406
+ with client.websocket_connect("/ws") as ws:
407
+ _ws_send_recv(ws, {"type": "reset", "seed": 42})
408
+ resp = _ws_send_recv(ws, {"type": "step", "action": action})
409
+
410
+ assert resp["type"] == "step_ok"
411
+ assert resp["done"] is False
412
+ assert resp["reward"] == 0.0
413
+ assert resp["observation"] is not None
414
+
415
+ def test_ws_full_episode_real_reward(self, client: TestClient) -> None:
416
+ """Propose → accept returns real reward breakdown, not stub 0.8."""
417
+ action = _good_action_payload(client)
418
+ with client.websocket_connect("/ws") as ws:
419
+ _ws_send_recv(ws, {"type": "reset", "seed": 42})
420
+ _ws_send_recv(ws, {"type": "step", "action": action})
421
+ resp = _ws_send_recv(ws, {"type": "step", "action": _accept_action_payload()})
422
+
423
+ assert resp["type"] == "step_ok"
424
+ assert resp["done"] is True
425
+ assert resp["reward"] > 0.0
426
+
427
+ info = resp["info"]
428
+ assert info["agreement_reached"] is True
429
+ assert info["verdict"] == "accept"
430
+ rb = info["reward_breakdown"]
431
+ assert rb is not None
432
+ assert 0.0 <= rb["rigor"] <= 1.0
433
+ assert 0.0 <= rb["feasibility"] <= 1.0
434
+ assert 0.0 <= rb["fidelity"] <= 1.0
435
+ assert not (rb["rigor"] == 0.8 and rb["feasibility"] == 0.8)
436
+
437
+ # -- error handling ------------------------------------------------------
438
+
439
+ def test_ws_invalid_json(self, client: TestClient) -> None:
440
+ with client.websocket_connect("/ws") as ws:
441
+ ws.send_text("not valid json {{{")
442
+ resp = json.loads(ws.receive_text())
443
+ assert resp["type"] == "error"
444
+ assert "Invalid JSON" in resp["message"]
445
+
446
+ def test_ws_missing_action_field(self, client: TestClient) -> None:
447
+ with client.websocket_connect("/ws") as ws:
448
+ _ws_send_recv(ws, {"type": "reset", "seed": 42})
449
+ resp = _ws_send_recv(ws, {"type": "step"})
450
+ assert resp["type"] == "error"
451
+ assert "Missing" in resp["message"]
452
+
453
+ def test_ws_invalid_action_payload(self, client: TestClient) -> None:
454
+ """Structurally invalid action (missing required fields) → WS error."""
455
+ with client.websocket_connect("/ws") as ws:
456
+ _ws_send_recv(ws, {"type": "reset", "seed": 42})
457
+ resp = _ws_send_recv(ws, {
458
+ "type": "step",
459
+ "action": {"action_type": "propose_protocol"},
460
+ })
461
+ assert resp["type"] == "error"
462
+ assert "Invalid action" in resp["message"]
463
+
464
+ def test_ws_unknown_message_type(self, client: TestClient) -> None:
465
+ with client.websocket_connect("/ws") as ws:
466
+ resp = _ws_send_recv(ws, {"type": "banana"})
467
+ assert resp["type"] == "error"
468
+ assert "Unknown" in resp["message"]
469
+
470
+ # -- session isolation ---------------------------------------------------
471
+
472
+ def test_ws_session_isolation(self, client: TestClient) -> None:
473
+ """Two WebSocket connections have independent env state."""
474
+ action = _good_action_payload(client)
475
+
476
+ with client.websocket_connect("/ws") as ws1:
477
+ r1 = _ws_send_recv(ws1, {"type": "reset", "seed": 1})
478
+ _ws_send_recv(ws1, {"type": "step", "action": action})
479
+
480
+ with client.websocket_connect("/ws") as ws2:
481
+ r2 = _ws_send_recv(ws2, {"type": "reset", "seed": 2})
482
+
483
+ assert r1["episode_id"] != r2["episode_id"]
484
+ # ws2 is at round 0, ws1 is at round 1
485
+ step2 = _ws_send_recv(ws2, {"type": "step", "action": action})
486
+ assert step2["observation"]["scientist"]["round_number"] == 1
487
+
488
+ # -- real-env integration (user-requested) --------------------------------
489
+
490
+ def test_ws_semantic_invalid_action_returns_step_ok_with_info_error(
491
+ self, client: TestClient
492
+ ) -> None:
493
+ """A structurally valid but semantically invalid action (e.g.
494
+ duration_days=999) returns step_ok with info.error — NOT a
495
+ transport-level WS error frame."""
496
+ with client.websocket_connect("/ws") as ws:
497
+ _ws_send_recv(ws, {"type": "reset", "seed": 42})
498
+ bad_action = {
499
+ "action_type": "propose_protocol",
500
+ "sample_size": 5,
501
+ "controls": ["baseline"],
502
+ "technique": "some technique",
503
+ "duration_days": 999,
504
+ "required_equipment": [],
505
+ "required_reagents": [],
506
+ "questions": [],
507
+ "rationale": "Duration is impossibly long for the lab.",
508
+ }
509
+ resp = _ws_send_recv(ws, {"type": "step", "action": bad_action})
510
+
511
+ assert resp["type"] == "step_ok"
512
+ assert resp["done"] is False
513
+ assert resp["info"]["error"] is not None
514
+ assert "Validation errors" in resp["info"]["error"]
515
+
516
+ def test_ws_timeout_verdict(self, client: TestClient) -> None:
517
+ """Run to max_rounds without accept → done=True, verdict=timeout,
518
+ reward=0.0. Proves real-env integration."""
519
+ action = _good_action_payload(client)
520
+ with client.websocket_connect("/ws") as ws:
521
+ reset_resp = _ws_send_recv(ws, {"type": "reset", "seed": 42})
522
+ max_rounds = reset_resp["observation"]["scientist"]["max_rounds"]
523
+
524
+ resp = None
525
+ for _ in range(max_rounds):
526
+ resp = _ws_send_recv(ws, {"type": "step", "action": action})
527
+
528
+ assert resp["done"] is True
529
+ assert resp["info"]["verdict"] == "timeout"
530
+ assert resp["reward"] == 0.0
531
+ assert resp["info"]["reward_breakdown"] is not None
532
+
533
+ def test_ws_terminal_episode_persists_real_replay_log(
534
+ self, client: TestClient
535
+ ) -> None:
536
+ """Complete a WS episode, then verify GET /replay/{episode_id}
537
+ returns real reward_breakdown, judge_notes, and verdict —
538
+ not stub strings."""
539
+ action = _good_action_payload(client)
540
+ with client.websocket_connect("/ws") as ws:
541
+ reset_resp = _ws_send_recv(ws, {"type": "reset", "seed": 42})
542
+ episode_id = reset_resp["episode_id"]
543
+
544
+ _ws_send_recv(ws, {"type": "step", "action": action})
545
+ _ws_send_recv(ws, {"type": "step", "action": _accept_action_payload()})
546
+
547
+ # Fetch replay via REST after WS connection is closed
548
+ replay_resp = client.get(f"/replay/{episode_id}")
549
+ assert replay_resp.status_code == 200
550
+ replay = replay_resp.json()
551
+
552
+ assert replay["agreement_reached"] is True
553
+ assert replay["verdict"] == "accept"
554
+ assert replay["total_reward"] > 0.0
555
+
556
+ # Real judge_notes, not stub
557
+ assert replay["judge_notes"] != ""
558
+ assert "Stub audit" not in replay["judge_notes"]
559
+ assert "rigor" in replay["judge_notes"]
560
+
561
+ # Real reward_breakdown with non-stub scores
562
+ rb = replay["reward_breakdown"]
563
+ assert rb is not None
564
+ assert 0.0 < rb["rigor"] <= 1.0
565
+ assert 0.0 < rb["feasibility"] <= 1.0
566
+ assert 0.0 < rb["fidelity"] <= 1.0
567
+ assert not (rb["rigor"] == 0.8 and rb["feasibility"] == 0.8)
568
+
569
+ # -- idle timeout & disconnect cleanup (API 07) -------------------------
570
+
571
+ def test_ws_idle_timeout_closes_connection(self, client: TestClient) -> None:
572
+ """API 07: server closes WebSocket after idle timeout (no messages)."""
573
+ with patch("server.app._WS_IDLE_TIMEOUT", 0.5):
574
+ with client.websocket_connect("/ws") as ws:
575
+ # Don't send anything — let the server-side timeout fire
576
+ time.sleep(1.0)
577
+ with pytest.raises(WebSocketDisconnect) as exc_info:
578
+ ws.receive_text()
579
+ assert exc_info.value.code == 1000
580
+
581
+ def test_ws_env_closes_on_disconnect(self, client: TestClient) -> None:
582
+ """API 07: env.close() runs in the finally block on disconnect."""
583
+ import server.app as _app
584
+
585
+ _original_make_env = _app._make_env
586
+ close_called: list[bool] = []
587
+
588
+ def _tracked_make_env():
589
+ env = _original_make_env()
590
+ _original_close = env.close
591
+
592
+ def _tracking_close():
593
+ close_called.append(True)
594
+ _original_close()
595
+
596
+ env.close = _tracking_close
597
+ return env
598
+
599
+ with patch.object(_app, "_make_env", _tracked_make_env):
600
+ with client.websocket_connect("/ws") as ws:
601
+ _ws_send_recv(ws, {"type": "ping"})
602
+ # Context manager exit sends disconnect; server runs finally block
603
+ # TestClient joins the ASGI thread, so close() has already run
604
+ assert len(close_called) == 1