Spaces:

ayushozha
/

replicalab

Running

App Files Files Community

ayushozha commited on 11 days ago

Commit

8b157ab

1 Parent(s): e50dca9

Recover env server client stack and deployment tracking

Browse files

Files changed (32) hide show

Dockerfile +34 -0
README.md +16 -4
ReplicaLab_Comprehensive_Task_Division.md +126 -64
docs/ayush/task_breakdown.md +78 -75
docs/ayush/task_list.md +11 -8
docs/changes.md +16 -0
docs/completion.md +87 -34
docs/fnd08_frozen_json_contract.md +11 -0
docs/kian/task_breakdown.md +25 -14
docs/kian/task_list.md +25 -6
docs/map/scoring.md +37 -10
docs/map/server.md +54 -102
docs/map/tests.md +197 -6
docs/max/deployment.md +151 -16
docs/max/task_breakdown.md +15 -10
docs/max/task_list.md +19 -11
replicalab/__init__.py +3 -0
replicalab/agents/scientist_policy.py +15 -3
replicalab/client.py +333 -0
replicalab/scoring/__init__.py +3 -0
replicalab/scoring/rubric.py +99 -0
replicalab/utils/logging.py +159 -0
server/Dockerfile +2 -2
server/app.py +31 -24
server/requirements.txt +2 -0
tests/fixtures/api_schema_examples.json +491 -0
tests/fixtures/generate_api_examples.py +330 -0
tests/test_client.py +355 -0
tests/test_env.py +635 -0
tests/test_reward.py +105 -2
tests/test_scientist_policy.py +344 -0
tests/test_server.py +604 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,34 @@

+# Root-level Dockerfile for Hugging Face Spaces deployment.
+#
+# HF Spaces with sdk:docker expects the Dockerfile at the repo root.
+# This is identical to server/Dockerfile. Keep them in sync or remove
+# server/Dockerfile once the team standardizes on this root copy.
+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies first for better layer caching
+COPY server/requirements.txt ./server/requirements.txt
+RUN pip install --no-cache-dir -r server/requirements.txt
+# Copy package source
+COPY replicalab/ ./replicalab/
+COPY server/ ./server/
+COPY pyproject.toml ./
+# Install the replicalab package (non-editable, deps already present)
+RUN pip install --no-cache-dir . --no-deps
+# Run as a non-root user inside the container (HF Spaces requirement)
+RUN useradd -m -u 1000 appuser && chown -R appuser /app
+USER appuser
+EXPOSE 7860
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,3 +1,13 @@
 # ReplicaLab
 **A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
@@ -8,11 +18,13 @@ ReplicaLab trains an agent to negotiate high-quality plans under real constraint
 ## Current Build Status
-- The repository is still in the foundation stage.
-- The Python package foundation is verified through editable install plus shared-model import checks.
 - Shared contracts currently live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
-- A stub-backed FastAPI and WebSocket server scaffold now exists in `server/app.py`, while real environment wiring is still in progress.
-- `openenv.yaml` now exists and passes local OpenEnv validation.
 - The frozen outer contract remains stable while the internal scenario engine moves toward a normalized scenario pack.
 - The planned Lab Manager path is hybrid: model-backed negotiation language plus deterministic feasibility grounding.

+---
+title: ReplicaLab
+emoji: 🧪
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
+---
 # ReplicaLab
 **A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
 ## Current Build Status
+- The repository is now past the foundation stage and has a working real environment plus deterministic judge pipeline.
+- The Python package foundation is verified through editable install plus full test-suite checks.
 - Shared contracts currently live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
+- `server/app.py` now serves the real `ReplicaLabEnv` by default, with the legacy stub retained only as a fallback safety path.
+- `openenv.yaml` exists and passes local OpenEnv validation.
+- Local Docker validation has been completed for the server image on port `7860`.
+- Hugging Face Spaces Docker metadata is present in this README and the root `Dockerfile`; live hosted verification is still pending.
 - The frozen outer contract remains stable while the internal scenario engine moves toward a normalized scenario pack.
 - The planned Lab Manager path is hybrid: model-backed negotiation language plus deterministic feasibility grounding.

ReplicaLab_Comprehensive_Task_Division.md CHANGED Viewed

@@ -24,7 +24,7 @@ The goal is to let any team member pick up work immediately without confusion.
 **ReplicaLab** is an OpenEnv environment where a **Scientist agent** and a **Lab Manager agent** negotiate how to solve a constrained technical task under real world limits such as budget, tools, compute, schedule, stock, and staffing.
-The environment is used to **train the Scientist agent with reinforcement learning** so it learns to ask better questions, preserve objective quality, and produce more feasible plans under domain-specific constraints.
 The first domain focus is:
@@ -40,8 +40,8 @@ By judging time, the project should demonstrate:
 1. A working OpenEnv environment deployed on Hugging Face Spaces on port `7860`
 2. At least one full scenario family working end to end, with a target of three
-3. A Scientist agent that can interact with the environment
-4. A hybrid model-backed Lab Manager with deterministic feasibility grounding
 5. A deterministic judge and reward engine
 6. A Colab training notebook using Unsloth or HF TRL
 7. A reward curve showing improvement
@@ -58,31 +58,34 @@ By judging time, the project should demonstrate:
 1. OpenEnv environment implementation
 2. FastAPI and WebSocket serving
 3. Hugging Face Docker Space deployment
-4. Scientist agent with structured JSON action output
-5. Hybrid model-backed Lab Manager grounded by deterministic feasibility checks
 6. Judge rubric engine with deterministic scoring
 7. Three scenario families for MVP
    1. Mathematics reasoning and proof planning
    2. ML benchmark replication
    3. Finance or trading backtest planning
-8. Reward logging
-9. Replay logs
-10. Colab RL notebook
-11. Reward curve image
-12. Thin React plus Vite frontend or OpenEnv `/web` fallback
-13. README, demo video, submission package
 ## 3.2 Out of scope for the hackathon MVP
 1. Proving whether a real research paper is globally true or false
-2. Parsing arbitrary real papers from the internet
 3. Real wet lab execution
 4. Live trading or production finance execution
 5. Real time collaboration features
 6. Training both Scientist and Lab Manager in self play
-7. Complex third party enterprise integrations
-8. Full multi-domain rollout unless time remains
-9. Manager-led subagent orchestration unless the MVP is already stable
 ---
@@ -161,6 +164,56 @@ Rules for the normalized scenario layer:
 3. Difficulty and curriculum changes should mechanically alter constraints, resources, or conflicts rather than fork separate prompt logic.
 4. The deterministic scorer compares the final agreed plan against `hidden_reference_spec`; model-backed roles never own truth.
 ---
 ## 5. Module and function ownership map
@@ -175,6 +228,9 @@ Rules for the normalized scenario layer:
 | `replicalab/agents/scientist_policy.py` | `build_scientist_prompt()`, `parse_scientist_output()` | Person B | trainable role |
 | `replicalab/agents/lab_manager_policy.py` | `generate_lab_manager_response()`, `check_feasibility()` | Person B with Person A | model-backed negotiation grounded by deterministic checker |
 | `replicalab/agents/judge_policy.py` | `explain_judgement()` optional only | Person A | explanation layer only |
 | `replicalab/scoring/rigor.py` | `score_rigor()` | Person A | deterministic |
 | `replicalab/scoring/feasibility.py` | `score_feasibility()` | Person A | deterministic |
 | `replicalab/scoring/fidelity.py` | `score_fidelity()` | Person A | deterministic |
@@ -297,12 +353,12 @@ Create a stable shared codebase, contracts, and development workflow so all work
 - Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
 - Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
 - Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
-- Partial backend scope imported from Max's PR: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md` were normalized onto the current standards and validated locally against the stub env
 - Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
 - Newly unblocked by `FND 06`: `DOC 01`
 - Newly unblocked by `FND 03`: `FND 13`, `UI 01`
 - Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
-- Remaining completion items for the imported backend scaffold: real-env integration, Docker validation, and final deployment verification
 - Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
 - Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
 - Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
@@ -453,14 +509,14 @@ As the Lab Manager, I want grounded negotiation plus deterministic feasibility c
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | ✅ Completed | — |
 | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | ✅ Completed | — |
-| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ⬜ Not started | — |
 | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ✅ Completed | — |
 | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ✅ Completed | Person B (Ayush) |
 | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ✅ Completed | — |
 | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | ✅ Completed | — |
-| AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | ⬜ Not started | — |
 | AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
-| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
 | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ✅ Completed | — |
 ---
@@ -478,20 +534,26 @@ As the training system, I need a stable reward so the model can improve.
 **US E05.2**
 As a judge, I need a readable score breakdown so I can understand why the environment rewarded or penalized the agent.
 ### Tasks
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | SCN 08 | 1.25h | score is between 0 and 1 and matches rubric examples | ⬜ Not started | — |
-| JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches normalized constraint logic | ⬜ Not started | — |
-| JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples | ⬜ Not started | — |
-| JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output | ⬜ Not started | — |
-| JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores and penalties | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, and penalties | ⬜ Not started | — |
-| JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric and introduces no new hidden logic | ⬜ Not started | — |
-| JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement | ⬜ Not started | — |
 | JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | ⬜ Not started | — |
 | JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | ⬜ Not started | — |
-| JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity over time | ⬜ Not started | — |
 | JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | ⬜ Not started | — |
 ---
@@ -516,14 +578,14 @@ As a judge, I want deterministic replay and cleanup.
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | ⬜ Not started | — |
-| ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | ⬜ Not started | — |
-| ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application | ENV 02, AGT 05 | 1h | valid Scientist action updates state and history correctly | ⬜ Not started | — |
-| ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step | ENV 03, AGT 07 | 1h | lab manager response is appended and returned in the next observation | ⬜ Not started | — |
-| ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | ⬜ Not started | — |
-| ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward and breakdown info | ⬜ Not started | — |
-| ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | ⬜ Not started | — |
-| ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | ⬜ Not started | — |
 | ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | ⬜ Not started | — |
 | ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | ⬜ Not started | — |
 | ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | ⬜ Not started | — |
@@ -548,23 +610,23 @@ As the team, we want one click reproducible deployment to HF Spaces.
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | 🟡 Partial | — |
-| API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | 🟡 Partial | — |
-| API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | 🟡 Partial | — |
-| API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | 🟡 Partial | — |
 | API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | ⬜ Not started | — |
-| API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | 🟡 Partial | — |
-| API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | 🟡 Partial | — |
-| API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | 🟡 Partial | — |
-| API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | ⬜ Not started | — |
 | API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | ⬜ Not started | — |
 | API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | ⬜ Not started | — |
 | API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | ⬜ Not started | — |
-| API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | 🟡 Partial | — |
 | API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | 🟡 Partial | — |
-| API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | ⬜ Not started | — |
 | API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | ⬜ Not started | — |
-| API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for Scientist LLM access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | ⬜ Not started | — |
-| API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes` and verdict fields without separate log file access | ⬜ Not started | — |
 | API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | ⬜ Not started | — |
 ---
@@ -586,21 +648,21 @@ As the team, we want a repeatable evaluation workflow for before versus after co
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order | ⬜ Not started | — |
 | TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | ⬜ Not started | — |
-| TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env | ⬜ Not started | — |
-| TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, and done signals | ⬜ Not started | — |
-| TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors | ⬜ Not started | — |
-| TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, and rounds used | JDG 10, TRN 04 | 0.75h | notebook stores metrics frame across training episodes | ⬜ Not started | — |
 | TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | ⬜ Not started | — |
-| TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios | ⬜ Not started | — |
 | TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | ⬜ Not started | — |
 | TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | ⬜ Not started | — |
 | TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | ⬜ Not started | — |
 | TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | ⬜ Not started | — |
-| TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | ⬜ Not started | — |
 | TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | ⬜ Not started | — |
-| TRN 15 | E08.2 | Person B | notebook | Add agreement rate and invalid action rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, and invalid action rate for baseline and trained runs | ⬜ Not started | — |
 ---
@@ -661,7 +723,7 @@ As a judge, I want the same seeded scenario to be replayable.
 | OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | ⬜ Not started | — |
 | OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | ⬜ Not started | — |
 | OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | ⬜ Not started | — |
-| OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility | ⬜ Not started | — |
 | OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | ⬜ Not started | — |
 | OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | ⬜ Not started | — |
 | OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | ⬜ Not started | — |
@@ -685,15 +747,15 @@ As a judge, I want the system to work reliably when clicked live.
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | ⬜ Not started | — |
-| TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | ⬜ Not started | — |
-| TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | ⬜ Not started | — |
-| TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | ⬜ Not started | — |
-| TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | ⬜ Not started | — |
 | TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | ⬜ Not started | — |
-| TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | ⬜ Not started | — |
 | TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | ⬜ Not started | — |
-| TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits | ⬜ Not started | — |
 | TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | ⬜ Not started | — |
 | TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | ⬜ Not started | — |
 | TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | ⬜ Not started | — |
@@ -877,7 +939,7 @@ The environment client must expose:
 3. reward
 4. done
 5. final info including component scores
-6. API key or secret configuration for LLM access in both hosted and notebook environments
 ### Scenario to judge contract
 Every scenario must provide:

 **ReplicaLab** is an OpenEnv environment where a **Scientist agent** and a **Lab Manager agent** negotiate how to solve a constrained technical task under real world limits such as budget, tools, compute, schedule, stock, and staffing.
+The environment is used to **train the Scientist agent with reinforcement learning** so it learns to ask better questions, preserve objective quality, use bounded evidence tools correctly, and produce more feasible plans under domain-specific constraints.
 The first domain focus is:
 1. A working OpenEnv environment deployed on Hugging Face Spaces on port `7860`
 2. At least one full scenario family working end to end, with a target of three
+3. A Scientist agent that can interact with the environment through structured actions and bounded evidence tools
+4. A hybrid model-backed Lab Manager with deterministic feasibility grounding and bounded validation tools
 5. A deterministic judge and reward engine
 6. A Colab training notebook using Unsloth or HF TRL
 7. A reward curve showing improvement
 1. OpenEnv environment implementation
 2. FastAPI and WebSocket serving
 3. Hugging Face Docker Space deployment
+4. Scientist agent with structured JSON action output plus bounded search, code-check, and image-inspection capability
+5. Hybrid model-backed Lab Manager grounded by deterministic feasibility checks plus bounded validation tools
 6. Judge rubric engine with deterministic scoring
 7. Three scenario families for MVP
    1. Mathematics reasoning and proof planning
    2. ML benchmark replication
    3. Finance or trading backtest planning
+8. Frozen evidence packs for deterministic training plus limited live validation during demo or eval
+9. Reward logging
+10. Replay logs
+11. Colab RL notebook
+12. Reward curve image
+13. Thin React plus Vite frontend or OpenEnv `/web` fallback
+14. README, demo video, submission package
 ## 3.2 Out of scope for the hackathon MVP
 1. Proving whether a real research paper is globally true or false
+2. Unrestricted parsing of arbitrary live internet content inside the training loop
 3. Real wet lab execution
 4. Live trading or production finance execution
 5. Real time collaboration features
 6. Training both Scientist and Lab Manager in self play
+7. Open-ended autonomous coding outside a bounded verification or analysis sandbox
+8. Image generation or audio capabilities in the agent policy loop
+9. Complex third party enterprise integrations
+10. Full multi-domain rollout unless time remains
+11. Manager-led subagent orchestration unless the MVP is already stable
 ---
 3. Difficulty and curriculum changes should mechanically alter constraints, resources, or conflicts rather than fork separate prompt logic.
 4. The deterministic scorer compares the final agreed plan against `hidden_reference_spec`; model-backed roles never own truth.
+For the bounded-tool MVP, pending scenario and environment work will extend the
+existing normalized scenario pack with additive evidence fields. This is an
+extension below the frozen outer contract, not a reopening of `FND 08`,
+`MOD 01`, `MOD 02`, or `MOD 03`.
+Tool-capable scenario extensions:
+1. `evidence_pack`
+2. `artifact_refs`
+3. `allowed_tools`
+4. `tool_budget`
+5. `validation_policy`
+## 4.3 Bounded tool capability policy
+The richer-capability MVP keeps the final outward action contract stable while
+adding bounded tools below it.
+### Scientist allowed capabilities
+1. `search_evidence`
+   - retrieve supporting facts, benchmark rules, paper details, or official references
+   - not a reward source
+2. `run_code_check`
+   - bounded code or config analysis, metric checks, value generation, runtime or cost estimation
+3. `inspect_image`
+   - read tables, plots, figures, screenshots, and charts for evidence extraction
+### Lab Manager allowed capabilities
+1. `search_resources`
+   - retrieve resource, policy, benchmark, or documentation constraints
+2. `run_code_check`
+   - validate cost, runtime, config, reproducibility, or execution assumptions
+3. `inspect_image`
+   - inspect figures, charts, and screenshots relevant to feasibility or policy review
+### Judge capability rules
+1. The judge reward remains deterministic and must not depend on live web search.
+2. Tool traces and evidence references may inform deterministic penalties, bonuses, or audit text.
+3. The judge may use bounded evidence verification for demo or audit text, but never as the training reward source.
+### Training and demo rules
+1. Training uses frozen evidence packs and deterministic tool traces whenever possible.
+2. Live web search is limited to demo-time or eval-time validation, not the core training reward loop.
+3. Image generation and audio are excluded from the policy loop for the hackathon MVP.
+4. Coding capability must stay sandboxed and task-scoped rather than open-ended.
 ---
 ## 5. Module and function ownership map
 | `replicalab/agents/scientist_policy.py` | `build_scientist_prompt()`, `parse_scientist_output()` | Person B | trainable role |
 | `replicalab/agents/lab_manager_policy.py` | `generate_lab_manager_response()`, `check_feasibility()` | Person B with Person A | model-backed negotiation grounded by deterministic checker |
 | `replicalab/agents/judge_policy.py` | `explain_judgement()` optional only | Person A | explanation layer only |
+| `replicalab/tools/search.py` | `search_evidence()`, `search_resources()` | Person B with Person C | bounded retrieval and validation only |
+| `replicalab/tools/code_tools.py` | `run_code_check()` | Person B | bounded code analysis, config checks, and derived-value generation |
+| `replicalab/tools/image_tools.py` | `inspect_image()` | Person B with Person D | bounded table, chart, figure, and screenshot inspection |
 | `replicalab/scoring/rigor.py` | `score_rigor()` | Person A | deterministic |
 | `replicalab/scoring/feasibility.py` | `score_feasibility()` | Person A | deterministic |
 | `replicalab/scoring/fidelity.py` | `score_fidelity()` | Person A | deterministic |
 - Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
 - Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
 - Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
+- Backend and deployment scope imported from Max's PR has now been normalized onto the current standards, validated against the real env, Docker-verified locally, and extended with HF Spaces metadata plus deployment instructions
 - Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
 - Newly unblocked by `FND 06`: `DOC 01`
 - Newly unblocked by `FND 03`: `FND 13`, `UI 01`
 - Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
+- Remaining completion items for the backend and deployment path: live HF Space bring-up (`API 10`), secrets documentation (`API 17`), replay persistence, and the remaining partial API polish tasks
 - Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
 - Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
 - Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | ✅ Completed | — |
 | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | ✅ Completed | — |
+| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ✅ Completed | — |
 | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ✅ Completed | — |
 | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ✅ Completed | Person B (Ayush) |
 | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ✅ Completed | — |
 | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | ✅ Completed | — |
+| AGT 08 | E04.1 | Person B | tests | Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path, malformed output handling, and stable tool-policy reminders | ✅ Completed | — |
 | AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ⬜ Not started | — |
+| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt`, including bounded rules for search, code checks, and image inspection | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior | ⬜ Not started | — |
 | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ✅ Completed | — |
 ---
 **US E05.2**
 As a judge, I need a readable score breakdown so I can understand why the environment rewarded or penalized the agent.
+### Executor notes
+- `JDG 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `JDG 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `JDG 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
 ### Tasks
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor or objective-validity score for plan completeness, required checks, method quality, justification, and correct bounded evidence use when present | SCN 08 | 1.25h | score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning without depending on live web results | ✅ Completed | Person B (Ayush) |
+| JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, resources, time, staffing, compute, bookings, and deterministic tool-backed validation results | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches normalized constraint logic plus deterministic tool outcomes | ✅ Completed | Person B (Ayush) |
+| JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score against hidden reference spec, required steps, allowed substitutions, and supported evidence claims when present | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples for plan and evidence alignment | ✅ Completed | Person B (Ayush) |
+| JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties, including deterministic penalties for invalid tool use or unsupported evidence claims | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output for plan quality and bounded tool behavior | ✅ Completed | Person B (Ayush) |
+| JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores, penalties, and tool-use diagnostics | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics | ✅ Completed | Person B (Ayush) |
+| JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic | ⬜ Not started | — |
+| JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics | ⬜ Not started | — |
 | JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | ⬜ Not started | — |
 | JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | ⬜ Not started | — |
+| JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity, and bounded tool metrics over time | ⬜ Not started | — |
 | JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | ⬜ Not started | — |
 ---
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | ✅ Completed | Person B (Ayush) |
+| ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | ✅ Completed | Person B (Ayush) |
+| ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application and bounded tool mediation | ENV 02, AGT 05 | 1h | valid Scientist action plus any allowed tool traces update state and history correctly | ✅ Completed | Person B (Ayush) |
+| ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step with bounded validation tools | ENV 03, AGT 07 | 1h | lab manager response plus any supporting bounded tool traces are appended and returned in the next observation | ✅ Completed | Person B (Ayush) |
+| ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | ✅ Completed | Person B (Ayush) |
+| ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward, breakdown info, and deterministic penalties or bonuses for bounded tool behavior | ✅ Completed | Person B (Ayush) |
+| ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | ✅ Completed | Person B (Ayush) |
+| ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | ✅ Completed | Person B (Ayush) |
 | ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | ⬜ Not started | — |
 | ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | ⬜ Not started | — |
 | ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | ⬜ Not started | — |
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | 🟡 Partial | — |
+| API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | ✅ Completed | Person B (Ayush) |
+| API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | ✅ Completed | Person B (Ayush) |
+| API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | ✅ Completed | Person B (Ayush) |
 | API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | ⬜ Not started | — |
+| API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | ✅ Completed | Person B (Ayush) |
+| API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | ✅ Completed | Person B (Ayush) |
+| API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | ✅ Completed | Person B (Ayush) |
+| API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | ✅ Completed | Person B (Ayush) |
 | API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | ⬜ Not started | — |
 | API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | ⬜ Not started | — |
 | API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | ⬜ Not started | — |
+| API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | ✅ Completed | Person B (Ayush) |
 | API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | 🟡 Partial | — |
+| API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | ✅ Completed | Person B (Ayush) |
 | API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | ⬜ Not started | — |
+| API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for hosted Scientist model access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | ⬜ Not started | — |
+| API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload plus bounded tool-trace summaries in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes`, verdict fields, and bounded tool audit data without separate log file access | ⬜ Not started | — |
 | API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | ⬜ Not started | — |
 ---
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, bounded-tool policy, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order and documents the bounded-tool policy | ⬜ Not started | — |
 | TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | ⬜ Not started | — |
+| TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env and can read tool-aware step payloads | ⬜ Not started | — |
+| TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs | ⬜ Not started | — |
+| TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs | ⬜ Not started | — |
+| TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics | JDG 10, TRN 04 | 0.75h | notebook stores a metrics frame across training episodes including bounded tool metrics | ⬜ Not started | — |
 | TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | ⬜ Not started | — |
+| TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds and frozen evidence packs | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios and evidence packs | ⬜ Not started | — |
 | TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | ⬜ Not started | — |
 | TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | ⬜ Not started | — |
 | TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | ⬜ Not started | — |
 | TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | ⬜ Not started | — |
+| TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | ✅ Done | 2026-03-08 |
 | TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | ⬜ Not started | — |
+| TRN 15 | E08.2 | Person B | notebook | Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs | ⬜ Not started | — |
 ---
 | OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | ⬜ Not started | — |
 | OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | ⬜ Not started | — |
 | OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | ⬜ Not started | — |
+| OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy | ⬜ Not started | — |
 | OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | ⬜ Not started | — |
 | OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | ⬜ Not started | — |
 | OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | ⬜ Not started | — |
 | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | ✅ Completed | Person B (Ayush) |
+| TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | ✅ Completed | Person B (Ayush) |
+| TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | ✅ Completed | Person B (Ayush) |
+| TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | ✅ Completed | Person B (Ayush) |
+| TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | ✅ Completed | Person B (Ayush) |
 | TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | ⬜ Not started | — |
+| TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | ✅ Completed | Person B (Ayush) |
 | TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | ⬜ Not started | — |
+| TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs | ⬜ Not started | — |
 | TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | ⬜ Not started | — |
 | TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | ⬜ Not started | — |
 | TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | ⬜ Not started | — |
 3. reward
 4. done
 5. final info including component scores
+6. API key or secret configuration for hosted-model access in both hosted and notebook environments
 ### Scenario to judge contract
 Every scenario must provide:

docs/ayush/task_breakdown.md CHANGED Viewed

@@ -9,39 +9,52 @@ No assumptions from other documents are used to reclassify blocked status.
 ## 1. Blocking Status
-`FND 08`, `FND 09`, `MOD 09`, `SCN 11`, and `AGT 01` are now complete.
 The scenario prerequisite bundle (`SCN 01` to `SCN 10`) also exists in the
 repo, so Ayush no longer waits on `SCN 09` to start prompt-layer work.
 Ayush now has one fully unblocked task:
-1. `AGT 03` -- highest leverage next task inside the Scientist chain
 The prompt and Lab Manager workstream continues to assume a normalized scenario
 pack below the stable outer contract, so Ayush-owned prompting should be
 assembled from mapped scenario data rather than hard-coded to one domain.
 ---
 ## 2. Active Now
 | ID | Task | Depends On | Why It Is Ready | Est |
 |----|------|-----------|-----------------|-----|
-| AGT 03 | Parse plus retry for malformed output | MOD 09, AGT 02 | The parser and observation formatter are now both complete | 0.75h |
-**Total: 1 task, 0.75h**
 ---
-## 3. Internal Ayush Chain After AGT 03
 These are blocked only by earlier Ayush-owned work.
 | ID | Task | Depends On | Blocked By | Est |
 |----|------|-----------|-----------|-----|
-| AGT 08 | Prompt formatting and parse tests | AGT 01 to AGT 04 | Person B: AGT 03 | 0.75h |
-**Total: 1 task, 0.75h**
 ---
@@ -49,46 +62,33 @@ These are blocked only by earlier Ayush-owned work.
 | ID | Task | Depends On | Remaining External Deliverable | Est |
 |----|------|-----------|-------------------------------|-----|
-| JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | `JDG 05` from Kian and `JDG 07` from Max | 0.5h |
-**Total: 1 task, 0.5h**
 ### What to ask Kian for first
-1. `JDG 05` and `JDG 06` -- unlock `JDG 10` and later `AGT 10`
 2. `SCN 13` -- deepens the booking-conflict layer for the Lab Manager path
-3. `ENV 01` -- makes the real environment path available beyond the stub server
----
-## 5. Mixed Chain After AGT 05 and Judge Work
-These depend on both Ayush-owned work and remaining upstream work.
-| ID | Task | Depends On | Blocked By | Est |
-|----|------|-----------|-----------|-----|
-| AGT 10 | Write domain-neutral prompt text files for all 3 roles | AGT 01, AGT 07, JDG 06 | Person A: JDG 06 | 0.75h |
-**Total: 1 task, 0.75h**
 ---
-## 6. Blocked by Max (Person C)
-Cannot proceed until Max delivers the server and deployment pieces.
 | ID | Task | Depends On | Max Deliverable | Est |
 |----|------|-----------|----------------|-----|
-| TRN 01 | Notebook skeleton | API 10 | Deployed HF Space | 0.5h |
-| TRN 03 | Env client wrapper in notebook | API 06 | WebSocket handler against the real env | 1h |
-| TRN 13 | `client.py` reusable module | API 06 | WebSocket handler against the real env | 1h |
-**Total: 3 tasks, 2.5h**
 ### What to ask Max for first
-1. `API 06` -- unblocks `TRN 03` and `TRN 13`
-2. `API 10` -- unblocks `TRN 01`
 ---
@@ -99,19 +99,20 @@ are done.
 | Order | ID | Task | Depends On | Est |
 |-------|----|------|-----------|-----|
-| 1 | TRN 02 | Package install and model setup cell | TRN 01 | 0.75h |
-| 2 | TRN 14 | Select and document base model (notebook side) | TRN 01 | 0.5h |
-| 3 | TRN 04 | Rollout collection loop | TRN 03, AGT 01 | 1h |
-| 4 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | 1.25h |
-| 5 | TRN 06 | Log episode metrics | JDG 10, TRN 04 | 0.75h |
-| 6 | TRN 07 | Plot reward curves | TRN 06 | 0.5h |
-| 7 | TRN 08 | Before vs after eval on fixed seeds | SCN 11, TRN 05 | 1h |
-| 8 | TRN 09 | Policy loading for trained checkpoint | TRN 05 | 0.5h |
-| 9 | TRN 10 | Export plots to outputs/plots | TRN 07 | 0.25h |
-| 10 | TRN 15 | Agreement and invalid action rate aggregation | TRN 06, TRN 08, OBS 09 | 0.5h |
-| 11 | OBS 06 | Log training run metadata | TRN 06 | 0.5h |
-**Total: 11 tasks, 7.5h**
 ---
@@ -134,43 +135,46 @@ are done.
 3. `MOD 09`
 4. `SCN 11`
 5. `AGT 01`
 ### Phase 2: Active now
-6. `AGT 03`
-### Phase 3: After AGT 03
-7. `AGT 08`
 ### Phase 4: After judge work
-8. `AGT 10`
-9. `JDG 10`
-### Phase 5: After Max lands `API 06` and `API 10`
-10. `TRN 13`
-11. `TRN 01`
-12. `TRN 02`
-13. `TRN 03`
-14. `TRN 14`
-### Phase 6: Training pipeline
-15. `TRN 04`
-16. `TRN 05`
-17. `TRN 06`
-18. `TRN 07`
-19. `TRN 08`
-20. `TRN 09`
-21. `TRN 10`
-22. `TRN 15`
-23. `OBS 06`
 ### Phase 7: Final notebook validation
-24. `TST 09`
 ---
@@ -178,14 +182,13 @@ are done.
 | Category | Count | Hours |
 |----------|-------|-------|
-| Active now | 1 | 0.75h |
-| Internal Ayush chain after AGT 03 | 1 | 0.75h |
-| Blocked by Kian or mixed A+B work | 1 | 0.5h |
-| Mixed chain after AGT 05 and judge work | 1 | 0.75h |
-| Blocked by Max | 3 | 2.5h |
-| Deep training chain | 11 | 7.5h |
 | Blocked by Kush | 1 | 0.5h |
-| **Total remaining** | **19** | **13.25h** |
 ---

 ## 1. Blocking Status
+`FND 08`, `FND 09`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`,
+`AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 08`, `AGT 11`, and `TRN 13`
+are now complete.
 The scenario prerequisite bundle (`SCN 01` to `SCN 10`) also exists in the
 repo, so Ayush no longer waits on `SCN 09` to start prompt-layer work.
 Ayush now has one fully unblocked task:
+1. `TRN 03` -- environment client wrapper for notebook rollouts (uses `replicalab/client.py` from TRN 13)
 The prompt and Lab Manager workstream continues to assume a normalized scenario
 pack below the stable outer contract, so Ayush-owned prompting should be
 assembled from mapped scenario data rather than hard-coded to one domain.
+Bounded-tool scope note:
+1. Ayush-owned prompt, training, and client tasks now assume bounded `search`,
+   `code_check`, and `image_inspection` capabilities.
+2. Training must still use frozen evidence packs and deterministic reward.
+3. Live web search is for validation or demo-time evidence only, not the core
+   reward loop.
+4. Audio remains out of scope.
 ---
 ## 2. Active Now
 | ID | Task | Depends On | Why It Is Ready | Est |
 |----|------|-----------|-----------------|-----|
+| TRN 03 | Env client wrapper in notebook | API 06, TRN 13 | `replicalab/client.py` is complete with dual-transport support; TRN 03 wraps it for notebook rollout use | 1h |
+**Total: 1 task, 1h**
 ---
+## 3. Internal Ayush Chain After API 06
 These are blocked only by earlier Ayush-owned work.
 | ID | Task | Depends On | Blocked By | Est |
 |----|------|-----------|-----------|-----|
+| TRN 04 | Rollout collection loop with frozen evidence packs and bounded tool traces | TRN 03, AGT 01 | Person B: TRN 03 | 1h |
+| TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | Person B: TRN 04 | 1.25h |
+| TRN 09 | Policy loading for trained checkpoint | TRN 05 | Person B: TRN 05 | 0.5h |
+**Total: 3 tasks, 2.75h**
 ---
 | ID | Task | Depends On | Remaining External Deliverable | Est |
 |----|------|-----------|-------------------------------|-----|
+| AGT 10 | Write domain-neutral prompt text files for all 3 roles with bounded tool rules | AGT 01, AGT 07, JDG 06 | `JDG 06` from Kian | 0.75h |
+| JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | `JDG 07` from Max | 0.5h |
+**Total: 2 tasks, 1.25h**
 ### What to ask Kian for first
+1. `JDG 06` -- unlocks `AGT 10`
 2. `SCN 13` -- deepens the booking-conflict layer for the Lab Manager path
+3. `ENV 10` and `JDG 08` -- strengthen the env or judge regression layer before training ramps
 ---
+## 5. Blocked by Max (Person C)
+Cannot proceed until Max delivers the remaining server and deployment pieces.
 | ID | Task | Depends On | Max Deliverable | Est |
 |----|------|-----------|----------------|-----|
+| TRN 01 | Notebook skeleton | API 10 | Deployed HF Space or stable hosted env URL | 0.5h |
+**Total: 1 task, 0.5h**
 ### What to ask Max for first
+1. `API 10` -- unlocks `TRN 01`
+2. `JDG 07` -- unlocks `JDG 10`
 ---
 | Order | ID | Task | Depends On | Est |
 |-------|----|------|-----------|-----|
+| 1 | TRN 01 | Notebook skeleton | API 10 | 0.5h |
+| 2 | TRN 02 | Package install and model setup cell | TRN 01 | 0.75h |
+| 3 | TRN 14 | Select and document base model (notebook side) | TRN 01 | 0.5h |
+| 4 | TRN 04 | Rollout collection loop with frozen evidence packs and bounded tool traces | TRN 03, AGT 01 | 1h |
+| 5 | TRN 05 | Connect rollouts to GRPO trainer | TRN 04 | 1.25h |
+| 6 | TRN 06 | Log episode metrics plus bounded tool metrics | JDG 10, TRN 04 | 0.75h |
+| 7 | TRN 07 | Plot reward curves | TRN 06 | 0.5h |
+| 8 | TRN 08 | Before vs after eval on fixed seeds and frozen evidence packs | SCN 11, TRN 05 | 1h |
+| 9 | TRN 09 | Policy loading for trained checkpoint | TRN 05 | 0.5h |
+| 10 | TRN 10 | Export plots to outputs/plots | TRN 07 | 0.25h |
+| 11 | TRN 15 | Agreement, invalid action, and invalid bounded-tool rate aggregation | TRN 06, TRN 08, OBS 09 | 0.5h |
+| 12 | OBS 06 | Log training run metadata | TRN 06 | 0.5h |
+**Total: 12 tasks, 8h**
 ---
 3. `MOD 09`
 4. `SCN 11`
 5. `AGT 01`
+6. `AGT 02`
+7. `AGT 03`
+8. `AGT 04`
+9. `AGT 05`
+10. `AGT 06`
+11. `AGT 07`
+12. `AGT 08`
+13. `AGT 11`
+14. `TRN 13`
 ### Phase 2: Active now
+15. `TRN 03`
+### Phase 3: After `API 10`
+16. `TRN 01`
+17. `TRN 02`
+18. `TRN 14`
 ### Phase 4: After judge work
+19. `AGT 10`
+20. `JDG 10`
+### Phase 5: Training pipeline
+21. `TRN 04`
+22. `TRN 05`
+23. `TRN 06`
+24. `TRN 07`
+25. `TRN 08`
+26. `TRN 09`
+27. `TRN 10`
+28. `TRN 15`
+29. `OBS 06`
 ### Phase 7: Final notebook validation
+30. `TST 09`
 ---
 | Category | Count | Hours |
 |----------|-------|-------|
+| Active now | 1 | 1h |
+| Internal Ayush chain after API 06 | 3 | 2.75h |
+| Blocked by Kian or mixed A+B work | 2 | 1.25h |
+| Blocked by Max | 1 | 0.5h |
+| Remaining downstream training chain | 8 | 4.75h |
 | Blocked by Kush | 1 | 0.5h |
+| **Total remaining** | **16** | **10.75h** |
 ---

docs/ayush/task_list.md CHANGED Viewed

@@ -11,14 +11,17 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - `SCN 11` is complete in `tests/fixtures/golden_scenarios.json`
 - `AGT 01` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 02` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 04` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 05` is complete in `replicalab/agents/lab_manager_policy.py`
 - `AGT 06` is complete in `replicalab/agents/lab_manager_policy.py`
 - `AGT 07` is complete in `replicalab/agents/lab_manager_policy.py`
 - `AGT 11` is complete in `docs/agt11_scientist_model_selection.md`
 - The scenario prerequisite bundle (`SCN 01` to `SCN 10`) is now present in the repo, so Ayush prompt work is backed by real normalized scenario packs instead of placeholders
-- The next fully unblocked Ayush task is `AGT 03`
-- `AGT 03` is now the highest-leverage next step because the formatter and parser are both in place, so the retry loop can complete the Scientist action path end-to-end
 - `AGT 10` now waits only on `JDG 06`
 ---
@@ -43,9 +46,9 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
 - [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
 - [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
 - [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
-- [ ] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: ready now
-- [ ] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04
 - [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
 ---
@@ -68,7 +71,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - [ ] **TRN 08** | Add before versus after evaluation on fixed seeds | 1h | Depends: SCN 11, TRN 05
 - [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
 - [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
-- [ ] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06
 - [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Assumption: Qwen3-4B primary, Qwen3-8B H100-only stretch
 - [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
@@ -97,6 +100,6 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 | Metric | Value |
 |--------|-------|
 | Total tasks | 29 |
-| Completed | 10 |
-| Remaining | 19 |
-| Total estimated hours | 21.5h |

 - `SCN 11` is complete in `tests/fixtures/golden_scenarios.json`
 - `AGT 01` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 02` is complete in `replicalab/agents/scientist_policy.py`
+- `AGT 03` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 04` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 05` is complete in `replicalab/agents/lab_manager_policy.py`
 - `AGT 06` is complete in `replicalab/agents/lab_manager_policy.py`
 - `AGT 07` is complete in `replicalab/agents/lab_manager_policy.py`
+- `AGT 08` is complete in `replicalab/agents/scientist_policy.py`
 - `AGT 11` is complete in `docs/agt11_scientist_model_selection.md`
 - The scenario prerequisite bundle (`SCN 01` to `SCN 10`) is now present in the repo, so Ayush prompt work is backed by real normalized scenario packs instead of placeholders
+- `API 06` is now complete, so `TRN 03` and `TRN 13` were fully unblocked
+- `TRN 13` is now complete in `replicalab/client.py`
+- The next fully unblocked Ayush task is `TRN 03`
 - `AGT 10` now waits only on `JDG 06`
 ---
 - [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
 - [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
 - [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
+- [x] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: completed on 2026-03-07
+- [x] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04 | Status: completed on 2026-03-07
 - [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
 - [ ] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06
 ---
 - [ ] **TRN 08** | Add before versus after evaluation on fixed seeds | 1h | Depends: SCN 11, TRN 05
 - [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
 - [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
+- [x] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06 | Status: completed on 2026-03-08
 - [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Assumption: Qwen3-4B primary, Qwen3-8B H100-only stretch
 - [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
 | Metric | Value |
 |--------|-------|
 | Total tasks | 29 |
+| Completed | 13 |
+| Remaining | 16 |
+| Total estimated hours | 11.75h |

docs/changes.md CHANGED Viewed

@@ -30,4 +30,20 @@ Rules:
 | 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
 | 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
 | 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |

 | 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
 | 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
 | 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
+| 2026-03-08 | Person B (Ayush) | Capability scope and backlog | Expanded the MVP from pure constrained negotiation to bounded evidence-backed research planning with scoped search, code-check, and image-inspection capability, while explicitly excluding audio and unrestricted live web in training | The team decided that research applicability requires richer capabilities, but the hackathon still needs a deterministic RL story with bounded tools and reproducible rewards | The source-of-truth backlog now treats richer capabilities as an additive layer below the frozen outer contract; completed schema and agent work stays valid, while pending prompt, judge, environment, API, and training tasks now absorb bounded tool and evidence-pack support | Keep live web mostly for demo or eval validation, and keep frozen evidence packs as the default training path |
+| 2026-03-07 | Person B (Ayush) | AGT 03 | Backlog showed "Not started" but the implementation (parse-and-retry loop with telemetry) already existed from a prior commit | The code and 7 tests were committed earlier but the tracker was never updated | Synced both `ReplicaLab_Comprehensive_Task_Division.md` and `docs/completion.md` to reflect completed status | None |
+| 2026-03-07 | Person B (Ayush) | AGT 08 | Expanded scope from test-only to tests plus a bounded-tool policy prompt patch in `build_scientist_system_prompt()` | The acceptance criteria required testing bounded-tool policy reminders, but no tool-policy text existed in the prompt yet; user directed adding the prompt text alongside the tests | Added policy block for `search_evidence`, `run_code_check`, and `inspect_image` to the system prompt; wrote 24 new tests covering parser, prompt, formatter, baseline, and bounded-tool policy; all 111 tests pass | None |
+| 2026-03-08 | Person B (Ayush) | ENV 01 | Executed the task even though it was assigned to Person A | The real environment class was still missing, but the server now switches to `ReplicaLabEnv` on successful import, so a working drop-in module was needed before environment and API work could safely proceed | Added `replicalab/env/replicalab_env.py` and `replicalab/env/__init__.py` as a working drop-in replacement for the former in-server stub, verified direct `reset() -> step() -> state() -> close()` behavior, and confirmed the full test suite stays green at `111 passed` | `ENV 02` and `ENV 08` are now unblocked, and the server can instantiate the real env class instead of the fallback stub |
+| 2026-03-08 | Person B (Ayush) | JDG 01, JDG 02, JDG 03 | Executed three scoring tasks assigned to Person A | The judge scoring chain was the next critical-path blocker: JDG 04 (total reward formula) depends on all three, and ENV 06 (reward integration) depends on JDG 05 which depends on JDG 04 | Added `replicalab/scoring/rigor.py` (weighted structural completeness, success criteria coverage, required element coverage), `replicalab/scoring/feasibility.py` (7-dimension partial-credit scorer wrapping AGT 05 feasibility checker), `replicalab/scoring/fidelity.py` (substitution-aware hidden-reference adherence scorer), shared `replicalab/utils/text.py` (token extraction and label normalization), `replicalab/scoring/__init__.py` (exports), and `tests/test_reward.py` (18 tests covering ordering, determinism, partial credit, domain range, and cross-scorer consistency); all 134 tests pass | JDG 04 is now unblocked; tracker docs were synced separately |
+| 2026-03-08 | Person B (Ayush) | ENV 02, ENV 03, ENV 04, ENV 05, ENV 06, ENV 07, ENV 08, JDG 04, JDG 05, TST 01, TST 02, TST 03 | Executed the full environment chain and rubric tasks assigned to Person A | The environment needed real scenario wiring, validation, grounded Lab Manager responses, centralized termination, judge-computed rewards, deep state snapshots, and close lifecycle guards; the rubric needed the total reward formula and breakdown builder; and the test suite needed reset, step, and invalid-action coverage | Rewrote `replicalab/env/replicalab_env.py` (ENV 02-08: scenario-pack-backed observations, protocol validation, grounded LM pipeline, accept-or-max-rounds termination, real judge scoring via rubric, deep state copies, closed-env guard), created `replicalab/scoring/rubric.py` (JDG 04-05: `compute_total_reward` with `10 × r × f × fi + bonuses − penalties`, `build_reward_breakdown` composing all three sub-scores with efficiency bonus), updated `replicalab/scoring/__init__.py` exports, and created `tests/test_env.py` (TST 01-03: 32 tests covering reset, step, invalid action, state snapshot, close/reopen, and rubric); all 166 tests pass | JDG 06, JDG 08, ENV 10, ENV 11, TST 04, TST 05 are now unblocked; partial server tasks (API 02, 03, 06, 07) can now wire against the real env |
+| 2026-03-07 | Person B (Ayush) | JDG 04, JDG 05, ENV 06 finalization | Refined the draft implementations to match final acceptance criteria | JDG 04 needed a zero-clamp floor and JDG 05 needed a named-penalty extension point for bounded-tool diagnostics; ENV 06 needed to distinguish timeout from no-agreement verdicts | `compute_total_reward` now clamps at 0.0; `build_reward_breakdown` accepts optional `penalties: dict[str, float]` for named penalty keys like `invalid_tool_use` and `unsupported_claim`; terminal-without-agreement path now returns `timeout` when max rounds reached vs `no_agreement` otherwise; added 8 new tests in `test_reward.py` and 4 new tests in `test_env.py`; 178 tests pass across the full suite | None |
+| 2026-03-07 | Person B (Ayush) | API 03 | Completed the `POST /step` endpoint task assigned to Person C by fixing stale replay logging and adding endpoint tests | The `_build_episode_log()` helper still hardcoded stub audit notes, rebuilt `RewardBreakdown` from state, and used `accept`/`revise` instead of the real `timeout`/`no_agreement` verdicts; both REST and WebSocket terminal paths used the stale helper; and no `/step` endpoint tests existed | Updated `_build_episode_log()` to accept the terminal `StepResult` and use its real `reward_breakdown`, `judge_notes`, and `verdict`; updated both REST `/step` and WebSocket step completion paths to pass the result; fixed `_StubEnv` reference to removed helper; added five endpoint tests covering happy path, invalid session 404, terminal real reward breakdown, semantic invalid action as 200 with `info.error`, and replay with real judge data; all 183 tests pass | API 14 and API 18 are now closer to completion; TST 06 is partially covered by the new tests |
+| 2026-03-07 | Person B (Ayush) | API 06 and TST 07 | Executed the WebSocket session handler task and its test task even though both were assigned to Person C | The WebSocket handler already existed in `server/app.py` but had no test coverage, and completing `API 06` was needed to unblock `TRN 03` and `TRN 13` in Person B's own lane | Added 12 WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict with real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`; all 195 tests pass; `TRN 03` and `TRN 13` are now unblocked for Person B | `TRN 03` and `TRN 13` are now the next Person B tasks |
+| 2026-03-08 | Person B (Ayush) | API 13 | Executed the task even though it was assigned to Person C | The CORS middleware already existed in `server/app.py`, but the task was still partial because frontend-origin verification had not been made explicit | Added three server tests covering localhost Vite preflight, Hugging Face Space origin preflight, and disallowed-origin rejection; `API 13` is now recorded complete in the source of truth and owner trackers | `API 02`, `API 04`, `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
+| 2026-03-08 | Person B (Ayush) | API 04 | Executed the task even though it was assigned to Person C | The `/scenarios` endpoint and its focused tests already met the acceptance criteria, but the task was still marked partial in the trackers | Recorded `API 04` complete in the source of truth and owner trackers based on the existing typed response model, normalized family list, and five dedicated endpoint tests | `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
+| 2026-03-08 | Person B (Ayush) | API 02 | Completed the `POST /reset` endpoint verification and test closure even though the task was assigned to Person C | The endpoint already worked against the real env via `_make_env()` but had no dedicated test coverage and was still marked partial in the tracker | Added seven dedicated `/reset` endpoint tests covering response shape, both-role observation, explicit session_id reuse with prior-env close, default params, all scenario and difficulty combos, and seed determinism; all 202 tests pass; `API 14` and `UI 06` are now closer to completion | None |
+| 2026-03-08 | Person B (Ayush) | TRN 13 | Implemented `replicalab/client.py` as specified in the task backlog | `API 06` was complete and `TRN 13` was the next unblocked Person B task | Created `ReplicaLabClient` with dual-transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager, internal session tracking, typed Pydantic returns, and 24 tests covering both transports; all 231 tests pass | `TRN 03` is now the next unblocked Person B task |
+| 2026-03-08 | Person B (Ayush) | API 07 | Completed the WebSocket idle-timeout and graceful-disconnect verification even though the task was assigned to Person C | The idle-timeout logic and `finally: env.close()` path already existed in `server/app.py`, but the task was still partial because resource-cleanup verification had not been made explicit | Added two focused WebSocket tests covering idle timeout close code `1000` and exactly-once `env.close()` on disconnect; `API 07` is now recorded complete in the source of truth and owner trackers | `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
+| 2026-03-08 | Person B (Ayush) | API 08 | Completed the Docker build and run verification even though the task was assigned to Person C | The Dockerfile existed but had never been verified end to end; editable install failed inside Docker, and `httpx` plus `websocket-client` were missing from `server/requirements.txt` | Fixed `pip install -e .` to `pip install .` in both `server/Dockerfile` and root `Dockerfile`; added `httpx` and `websocket-client` to `server/requirements.txt`; rebuilt without cache; verified container starts with `"env":"real"` and all four endpoints (`/health`, `/scenarios`, `/reset`, `/step`) respond correctly; added verified endpoint commands to `docs/max/deployment.md` | `API 09` and `API 16` are now unblocked |
+| 2026-03-08 | Person B (Ayush) | Recovery sync, API 09, API 15, TST 04, TST 05 | Recovered the lost env or server or client or test bundle from unreachable git objects and re-synced the deployment/testing trackers to the validated repo state | The branch had rolled back to `5538ba0`, which left the working code, deployment metadata, and tracker files out of sync even though the recovered code now passes 231 tests, Docker validation, and OpenEnv validation | Restored the missing runtime files, revalidated the real env and Docker path, recorded HF Space metadata tasks (`API 09`, `API 15`) as complete, and closed the two reward-regression tests (`TST 04`, `TST 05`) that are already covered in `tests/test_reward.py` | Live HF Space bring-up remains `API 10` |

docs/completion.md CHANGED Viewed

@@ -20,30 +20,32 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 | Metric | Value |
 |--------|-------|
 | Total tasks | 152 |
-| Completed | 38 |
-| Partial / active | 10 |
-| Remaining | 104 |
-| **Completion rate** | **25.00%** |
 ### Completion by Person
 | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
 |--------|----------|----------------|----------------------|-----------|------|
-| Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 20 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `AGT 05` done by Person B) | 28 | 42.86% |
-| Person B (Ayush) | 29 (27 solo + 2 shared with A) | 10 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 11`) | 0 | 19 | 34.48% |
-| Max (Person C) | 41 | 1 (`FND 11`) | 7 (`FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 12` done by others) | 33 | 19.51% |
 | Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
 | All (shared) | 3 | 2 (`FND 08`, `AGT 05`) | 0 | 1 | 66.67% |
 Note: Person B (Ayush) has completed two shared tasks in their own lane
-(`FND 08`, `AGT 05`) plus eight solo tasks in their own lane (`MOD 09`,
-`SCN 11`, `AGT 01`, `AGT 02`, `AGT 04`, `AGT 06`, `AGT 07`, `AGT 11`), and has also executed twenty-five tasks outside their assigned
-ownership (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`,
 `FND 09`, `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`,
-`MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`) to keep the Kian, Max, and Kush
-dependency chain moving. Ayush now has one fully unblocked implementation
-task available: `AGT 03`, with `AGT 10` reduced to a single remaining
-external dependency on `JDG 06`.
 ---
@@ -51,14 +53,7 @@ external dependency on `JDG 06`.
 | ID | Assigned To | Current Status | Remaining Acceptance Item |
 |----|-------------|----------------|---------------------------|
-| API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the stub env | Real env dependency and task-owner sign-off |
-| API 02 | Max (Person C) | `/reset` works locally against the stub env and now seeds normalized math / ML / finance scenarios through the shared generator | Real env reset dependency and task-owner sign-off |
-| API 03 | Max (Person C) | `/step` works locally against the stub env | Real env step dependency and task-owner sign-off |
-| API 04 | Max (Person C) | `/scenarios` returns the normalized scenario-family list from the shared generator | Real env exposure and task-owner sign-off |
-| API 06 | Max (Person C) | WebSocket reset, ping, and step work locally against the stub env, including normalized scenario-family resets | Real env integration and task-owner sign-off |
-| API 07 | Max (Person C) | Idle timeout and cleanup logic exist in the WebSocket path | Real env disconnect cleanup verification |
-| API 08 | Max (Person C) | `server/Dockerfile` exists | Local Docker build and run verification |
-| API 13 | Max (Person C) | CORS middleware exists for dev and hosted origins | Frontend integration verification |
 | API 14 | Max (Person C) | REST session isolation exists in the server stub path | Concurrent-session verification against the real env |
 | OBS 02 | Max (Person C) | Structured local logging exists in `server/app.py` | Logging behavior needs real-env usage confirmation |
@@ -95,6 +90,34 @@ external dependency on `JDG 06`.
 | SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
 | SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
 | SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
 ### Person B (Ayush) - Completed own tasks
@@ -104,11 +127,14 @@ external dependency on `JDG 06`.
 | SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
 | AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
 | AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
-| AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic non-LLM Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
 | AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
 | AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
 | AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
 | AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md` | 2026-03-08 | Recorded `Qwen3-4B` as the primary Scientist training model with `Qwen3-8B` as the H100-only stretch fallback, and surfaced the decision in the README so the training path uses one canonical model choice. | Decision is recorded and all team members know which model will be fine tuned | Yes - verified by the decision record and README update |
 ### Kush (Person D) - Completed on behalf of others
@@ -179,6 +205,28 @@ external dependency on `JDG 06`.
 | AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
 | AGT 07 | `AGT 10` now only waits on `JDG 06`, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
 | AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
 ### Current Unblocked and Active Tasks
@@ -186,15 +234,20 @@ external dependency on `JDG 06`.
 |----|-------|------|-------------|
 | FND 13 | Kush (Person D) | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 |
 | UI 01 | Kush (Person D) | Create application shell with three panel layout | FND 03 |
-| AGT 03 | Person B (Ayush) | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 |
 | MOD 06 | Kian (Person A) | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 |
 | MOD 07 | Max (Person C) | Add state serialization helper for replay logs | MOD 04 |
-| JDG 01 | Kian (Person A) | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | SCN 08 |
-| JDG 02 | Kian (Person A) | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | SCN 07, AGT 05 |
-| JDG 03 | Kian (Person A) | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | SCN 08 |
 | SCN 13 | Kian (Person A) | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 |
-| ENV 01 | Kian (Person A) | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 |
 | DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
 ---
@@ -205,12 +258,12 @@ external dependency on `JDG 06`.
 | E01. Foundations and repository setup | 13 | 12 | 92.31% |
 | E02. Domain models, validation, state contracts | 12 | 8 | 66.67% |
 | E03. Scenario engine and constraint generation | 13 | 11 | 84.62% |
-| E04. Scientist agent and Lab Manager policy | 11 | 7 | 63.64% |
-| E05. Judge engine and reward logic | 11 | 0 | 0% |
-| E06. OpenEnv environment implementation | 11 | 0 | 0% |
-| E07. API, server, Docker, deployment | 19 | 0 | 0% |
-| E08. RL training pipeline and evaluation | 15 | 0 | 0% |
 | E09. Frontend, UX, replay, demo views | 15 | 0 | 0% |
 | E10. Logging, replay, and observability | 9 | 0 | 0% |
-| E11. Testing and quality gates | 12 | 0 | 0% |
 | E12. README, demo video, submission packaging | 11 | 0 | 0% |

 | Metric | Value |
 |--------|-------|
 | Total tasks | 152 |
+| Completed | 67 |
+| Partial / active | 3 |
+| Remaining | 82 |
+| **Completion rate** | **44.08%** |
 ### Completion by Person
 | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
 |--------|----------|----------------|----------------------|-----------|------|
+| Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 38 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `AGT 05`, `ENV 01` to `ENV 08`, `JDG 01` to `JDG 05`, `TST 01` to `TST 05` done by Person B) | 10 | 79.59% |
+| Person B (Ayush) | 29 (27 solo + 2 shared with A) | 13 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 08`, `AGT 11`, `TRN 13`) | 0 | 16 | 44.83% |
+| Max (Person C) | 41 | 1 (`FND 11`) | 17 (`FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 12` done by others, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 09`, `API 13`, `API 15`, `TST 07` done by Person B) | 23 | 43.90% |
 | Kush (Person D) | 32 | 0 | 1 (`FND 06` done by Person B) | 31 | 3.13% |
 | All (shared) | 3 | 2 (`FND 08`, `AGT 05`) | 0 | 1 | 66.67% |
 Note: Person B (Ayush) has completed two shared tasks in their own lane
+(`FND 08`, `AGT 05`) plus eleven solo tasks in their own lane (`MOD 09`,
+`SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`, `AGT 04`, `AGT 06`, `AGT 07`,
+`AGT 08`, `AGT 11`, `TRN 13`), and has also executed a large cross-owner
+bundle (`FND 01`, `FND 02`, `FND 04`, `FND 05`, `FND 06`, `FND 07`,
 `FND 09`, `FND 10`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`,
+`MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `ENV 01` to `ENV 08`, `JDG 01`
+ to `JDG 05`, `TST 01` to `TST 05`, `API 02`, `API 03`, `API 04`, `API 06`,
+`API 07`, `API 08`, `API 09`, `API 13`, `API 15`, `TST 07`) to keep the
+Kian, Max, and Kush dependency chain moving.
+`TRN 13` is complete; `TRN 03` is the next unblocked task for Person B.
 ---
 | ID | Assigned To | Current Status | Remaining Acceptance Item |
 |----|-------------|----------------|---------------------------|
+| API 01 | Max (Person C) | FastAPI app shell and `/health` endpoint work locally against the real env | Task-owner sign-off and final deployment-path polish |
 | API 14 | Max (Person C) | REST session isolation exists in the server stub path | Concurrent-session verification against the real env |
 | OBS 02 | Max (Person C) | Structured local logging exists in `server/app.py` | Logging behavior needs real-env usage confirmation |
 | SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
 | SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
 | SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| ENV 01 | E06 | Person A | Create `ReplicaLabEnv` class skeleton | `replicalab/env/replicalab_env.py`, `replicalab/env/__init__.py` | 2026-03-08 | Added a real `ReplicaLabEnv` module as a drop-in replacement for the former in-server stub, ported the working stub behavior into the environment package, wired scenario-pack-backed reset or step or state or close methods with follow-on `TODO(ENV XX)` markers, and removed the old stub-only marker from `StepInfo` payloads. | Environment class imports and instantiates without runtime errors | Yes - verified with a direct `ReplicaLabEnv.reset(...) -> step(...) -> state() -> close()` smoke run and `python -m pytest` (`111 passed`) |
+| JDG 01 | E05 | Person A | Implement rigor or objective-validity score | `replicalab/scoring/rigor.py`, `replicalab/utils/text.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_rigor(protocol, scenario)` with weighted sub-scores for structural completeness (0.30), success criteria coverage (0.40), and required element coverage (0.30). Uses shared `element_tokens` from `replicalab/utils/text.py`. Five focused tests in `test_reward.py` cover quality ordering, determinism, controls impact, rationale length, and all-domain range validation. | Score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
+| JDG 02 | E05 | Person A | Implement feasibility score | `replicalab/scoring/feasibility.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_feasibility(protocol, scenario, check=None)` that derives a continuous [0,1] signal from `FeasibilityCheckResult` (AGT 05). Seven dimensions weighted equally (1/7) with partial credit for budget, equipment, reagents, and staff. Accepts optional pre-computed check to avoid redundant work. Six focused tests cover viable protocol, infeasible ordering, pre-computed check equivalence, determinism, partial credit, and all-domain range. | Score is between 0 and 1 and matches normalized constraint logic | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
+| JDG 03 | E05 | Person A | Implement fidelity score | `replicalab/scoring/fidelity.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_fidelity(protocol, scenario)` with substitution-aware scoring: required element coverage (0.50, direct match=1.0, substitution=0.7), flexible element alignment (0.20, bonus only), target metric alignment (0.20), and technique appropriateness (0.10). Five focused tests cover aligned vs misaligned ordering, determinism, substitution partial credit, target metric impact, and all-domain range. | Score is between 0 and 1 and matches rubric examples for plan and evidence alignment | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
+| JDG 04 | E05 | Person A | Implement total reward formula | `replicalab/scoring/rubric.py`, `tests/test_reward.py` | 2026-03-07 | `compute_total_reward(breakdown)` implements `10 × rigor × feasibility × fidelity + bonuses − penalties` with `max(0.0, ...)` floor clamp. Eight new tests in `test_reward.py` verify perfect-vs-broken ordering, zero-feasibility behavior, efficiency bonus ordering, exact penalty subtraction, zero-clamp floor, determinism, external penalties injection, and default-empty penalties. Seven existing rubric tests in `test_env.py` also cover the formula. | Total reward formula matches agreed math, clamps at zero, and returns consistent output for plan quality and bounded tool behavior | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass) |
+| JDG 05 | E05 | Person A | Build reward breakdown object | `replicalab/scoring/rubric.py`, `replicalab/scoring/__init__.py`, `tests/test_reward.py` | 2026-03-07 | `build_reward_breakdown(...)` accepts an optional `penalties: dict[str, float]` parameter for named penalty keys (e.g. `invalid_tool_use`, `unsupported_claim`) from bounded-tool diagnostics without reopening the model contract. Returns a typed `RewardBreakdown` with rigor, feasibility, fidelity, efficiency_bonus, communication_bonus, and penalties dict. Exported through `replicalab.scoring`. | Breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics extension point | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass) |
+| ENV 02 | E06 | Person A | Implement real reset wiring | `replicalab/env/replicalab_env.py` | 2026-03-08 | `_make_observation()` now uses the scenario pack as source of truth for booked/out-of-stock/safety data instead of empty placeholders. Eight reset tests verify both roles populated, booked/out-of-stock preserved, all templates and difficulties. | Reset returns initial observations with full scenario data | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 03 | E06 | Person A | Implement Scientist turn with validation | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_validate_scientist_action()` that runs `validate_protocol()` on proposals and returns structured error strings without crashing the env. Invalid actions don't advance the round. | Valid action updates state, invalid action returns structured error | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 04 | E06 | Person A | Implement Lab Manager response step | `replicalab/env/replicalab_env.py` | 2026-03-08 | `_lab_manager_action()` uses the full grounded pipeline: `check_feasibility()` → `suggest_alternative()` → `compose_lab_manager_response()`. | Lab Manager response is grounded in feasibility check results | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 05 | E06 | Person A | Centralize termination logic | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_check_termination()`: Scientist accept with existing protocol OR max_rounds. Lab Manager accept does NOT auto-terminate. | Episode terminates on agreement or round limit | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 06 | E06 | Person A | Wire real judge scoring | `replicalab/env/replicalab_env.py`, `tests/test_env.py` | 2026-03-07 | Terminal accept steps call `build_reward_breakdown()` and `compute_total_reward()` with real rigor/feasibility/fidelity scores stored in `EpisodeState`. Terminal-without-agreement path now distinguishes `timeout` (max rounds) from `no_agreement` verdict. Four new tests in `TestEnvReward` verify agreement-terminal breakdown/notes/verdict, no-agreement determinism, timeout verdict, and state-stored component scores. | Final step returns total reward, breakdown info, and deterministic penalties or bonuses; verdict distinguishes timeout from no_agreement | Yes - verified with `python -m pytest tests/test_env.py` (36 tests pass) and `python -m pytest` (178 tests pass) |
+| ENV 07 | E06 | Person A | Implement state() deep snapshot | `replicalab/env/replicalab_env.py` | 2026-03-08 | `state()` now returns `self._state.model_copy(deep=True)` so callers get an independent snapshot. Two tests verify mutation isolation. | State snapshot is independent of env internals | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 08 | E06 | Person A | Implement close() with lifecycle guard | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_closed` flag, idempotent `close()`, `_ensure_open()` guard on `step()`, and `reset()` reopens a closed env. Three tests verify idempotency, step-after-close raises, and reset-reopens. | Close frees resources and does not throw; step after close raises | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 01 | E11 | Person A | Add reset returns valid observations test | `tests/test_env.py` | 2026-03-08 | Eight tests in `TestReset` class covering both roles populated, scientist fields, lab manager fields, booked/out-of-stock preservation, state round zero, episode ID, clearing previous episode, and all templates/difficulties. | Test confirms both roles receive valid structured observations | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 02 | E11 | Person A | Add valid action step test | `tests/test_env.py` | 2026-03-08 | Eight tests in `TestStep` class covering round advancement, observation shape, conversation history, accept termination, real reward scores, max round termination, step info fields, and full propose-then-accept episode. | Valid action advances round and returns correct shape | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 03 | E11 | Person A | Add invalid action handling test | `tests/test_env.py` | 2026-03-08 | Four tests in `TestInvalidAction` class covering error string on invalid duration, env survival after error, no round advancement on invalid action, and request_info always passes. | Invalid action yields structured error and env survives | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 04 | E11 | Person A | Add perfect protocol high reward test | `tests/test_reward.py` | 2026-03-08 | Added reward-regression coverage proving a fully aligned protocol scores higher than a broken baseline and stays ordered consistently across reruns. | Perfect protocol scores higher than baseline and broken protocol | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) |
+| TST 05 | E11 | Person A | Add zero dimension or penalty behavior test | `tests/test_reward.py` | 2026-03-08 | Added reward-regression coverage for zero-feasibility collapse, exact penalty subtraction, and zero-floor clamp behavior so timeout and penalty paths lower reward deterministically. | Zero feasibility or timeout lowers reward as expected | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) |
+| API 03 | E07 | Person C | Add `POST /step` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-07 | Fixed `_build_episode_log()` to take the real `StepResult` instead of rebuilding reward data from state with stale stub values. Both REST `/step` and WebSocket step handler now pass the terminal `StepResult` to the updated helper so replay logs use real `reward_breakdown`, `judge_notes`, and `verdict` (including `timeout` vs `no_agreement`). Added five endpoint tests covering reset-then-step happy path, invalid session ID 404, terminal step with real reward breakdown, semantic invalid action returning 200 with `info.error`, and replay with real judge data. | Step endpoint accepts valid action and returns step result | Yes - verified with `python -m pytest tests/test_server.py` (10 tests pass) and `python -m pytest` (183 tests pass) |
+| API 06 | E07 | Person C | Add WebSocket session handler with isolated env per connection | `server/app.py`, `tests/test_server.py` | 2026-03-07 | WebSocket handler at `/ws` supports `reset`, `step`, and `ping` message types with per-connection env isolation, idle timeout, and replay storage on terminal episodes. Twelve WebSocket tests cover ping-pong, reset observation, step result, full episode real reward, invalid JSON, missing action field, invalid action payload, unknown message type, session isolation, semantic invalid action returning `step_ok` with `info.error`, timeout verdict proving real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`. | WebSocket session handler supports reset, step, ping with isolated env per connection and correct replay storage | Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) and `python -m pytest` (195 tests pass) |
+| TST 07 | E11 | Person C | Add WebSocket session handler tests | `tests/test_server.py` | 2026-03-07 | Twelve focused WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict, and replay log persistence with real judge data. Tests verify that structurally valid but semantically invalid actions return `step_ok` with `info.error` (not WS error frames), matching the env contract. | WebSocket tests cover happy path, error handling, session isolation, and real-env integration | Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) |
+| API 02 | E07 | Person C | Add `POST /reset` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-08 | `/reset` endpoint creates a new env (or closes the prior one when reusing `session_id`), calls `env.reset(...)`, persists env, `last_active`, and `episode_id` in the in-memory REST session store, and returns `session_id`, `episode_id`, `observation`. Seven dedicated tests cover response shape, both-role observation, explicit session_id reuse, prior-env close on reuse, default params, all scenario/difficulty combos, and seed determinism. | Reset endpoint starts a new episode and returns initial observation | Yes - verified with `python -m pytest tests/test_server.py` (29 tests pass) and `python -m pytest` (202 tests pass) |
+| API 04 | E07 | Person C | Add `GET /scenarios` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-08 | `GET /scenarios` returns the `available_scenario_families()` output through the typed `ScenariosResponse` model. Five focused tests cover status code, response shape, all three scenario families, the expected `easy`, `medium`, and `hard` difficulties, and the absence of extra keys. | Endpoint lists available scenario families and difficulties | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
+| API 07 | E07 | Person C | Add idle timeout and graceful disconnect cleanup | `server/app.py`, `tests/test_server.py` | 2026-03-08 | Verified the existing WebSocket idle-timeout and disconnect cleanup path with two focused tests: one monkeypatches the idle timeout to 0.5s and confirms the server closes with code 1000 when no message arrives, and one wraps `_make_env()` to confirm `env.close()` is called exactly once from the `finally` block on disconnect. | Stale connections close cleanly and the environment closes without leak | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
+| API 13 | E07 | Person C | Add CORS middleware configuration for frontend origins in dev and production | `server/app.py`, `tests/test_server.py` | 2026-03-08 | Confirmed the existing FastAPI CORS middleware allows the local Vite frontend origin plus `https://*.hf.space`, and added three explicit preflight tests covering localhost allowance, HF Space allowance, and disallowed-origin rejection. | Frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
+| API 08 | E07 | Person C | Build Dockerfile with Python app startup on port 7860 | `server/Dockerfile`, `Dockerfile`, `server/requirements.txt`, `docs/max/deployment.md` | 2026-03-08 | Fixed editable install (`-e .` → `. --no-deps`) in both `server/Dockerfile` and root `Dockerfile`, added `httpx` and `websocket-client` to `server/requirements.txt` (required by `replicalab.client`), rebuilt without cache. Verified Docker container starts with the **real env** (`"env":"real"`), and all four endpoints work: `GET /health`, `GET /scenarios`, `POST /reset`, `POST /step`. Added verified endpoint commands to `docs/max/deployment.md`. | Local Docker run serves app on port 7860 | Yes - verified with `docker build -f server/Dockerfile -t replicalab . && docker run -p 7860:7860 replicalab` and curl against all four endpoints |
+| API 09 | E07 | Person C | Add Hugging Face Space metadata and deploy instructions | `README.md`, `Dockerfile`, `docs/max/deployment.md` | 2026-03-08 | Added the Hugging Face Spaces YAML frontmatter to the root README, created the root-level `Dockerfile` required by the Docker SDK, and documented Space creation, git remote setup, push, logs, and secret management in `docs/max/deployment.md`. | Space config is valid for Docker app deployment | Yes - verified against HF Spaces Docker deployment requirements |
+| API 15 | E07 | Person C | Create HF Space README.md with YAML frontmatter | `README.md` | 2026-03-08 | Added the required Spaces frontmatter fields (`sdk: docker`, `app_port: 7860`, title, emoji, colors, pinned) to the root README so Hugging Face parses the Space metadata correctly on push. | HF Space config is valid and Space launches correctly from the metadata | Yes - verified against the HF Spaces frontmatter schema |
 ### Person B (Ayush) - Completed own tasks
 | SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
 | AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
 | AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
+| AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic baseline Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
 | AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
 | AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
 | AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
 | AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md` | 2026-03-08 | Recorded `Qwen3-4B` as the primary Scientist training model with `Qwen3-8B` as the H100-only stretch fallback, and surfaced the decision in the README so the training path uses one canonical model choice. | Decision is recorded and all team members know which model will be fine tuned | Yes - verified by the decision record and README update |
+| AGT 03 | E04 | Add parse plus retry strategy for malformed model output | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-07 | Added `call_scientist_with_retry(...)` with error-specific correction prompts, bounded retry loop, and exposed `RetryMetadata` telemetry. Seven focused tests cover first-try success, malformed-then-valid, invalid-then-valid, exhaustion, correction message content, and metadata serialization. | Malformed output triggers at least one controlled retry or explicit failure | Yes - verified with `python -m pytest tests/test_scientist_policy.py` (7 retry tests pass) |
+| AGT 08 | E04 | Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-07 | Added bounded-tool policy block to `build_scientist_system_prompt(...)` naming `search_evidence`, `run_code_check`, and `inspect_image` with explicit rules. Added 24 new tests covering parser happy paths (propose, accept, prose-wrapped), parser edge cases (empty, whitespace, list, extra keys, `to_dict()`), system prompt across all 3 domains plus dict coercion, bounded-tool policy assertions across all domains, role-boundary and output-contract assertions, formatter edge cases (final round, empty-list protocol), and baseline domain inference and forced-accept behavior. | Tests cover happy path, malformed output handling, and stable tool-policy reminders | Yes - verified with `python -m pytest tests/test_scientist_policy.py` (46 tests pass) and `python -m pytest tests/` (111 tests pass) |
+| TRN 13 | E08 | Create reusable environment client module | `replicalab/client.py`, `tests/test_client.py` | 2026-03-08 | Added `ReplicaLabClient` with dual transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager support, internal session ID tracking, typed returns mapped to Pydantic models, and constructor-level transport selection. Twenty-four tests cover both transports: connect, reset, step, full episode, replay, context manager, error paths, semantic invalid action handling, and constructor validation. | Client module can be imported by notebook and other consumers without duplicating connection logic | Yes - verified with `python -m pytest tests/test_client.py` (24 tests pass) and `python -m pytest` (231 tests pass) |
 ### Kush (Person D) - Completed on behalf of others
 | AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
 | AGT 07 | `AGT 10` now only waits on `JDG 06`, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
 | AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
+| ENV 01 | ENV 02, ENV 08, and the real-environment import path that partial server tasks now depend on |
+| JDG 01 | Together with JDG 02 and JDG 03, unblocks JDG 04 (total reward formula) |
+| JDG 02 | Together with JDG 01 and JDG 03, unblocks JDG 04 (total reward formula) |
+| JDG 03 | Together with JDG 01 and JDG 02, unblocks JDG 04 (total reward formula) |
+| JDG 04 | JDG 05, JDG 08, TST 04, TST 05 |
+| JDG 05 | JDG 06, JDG 07, JDG 09, JDG 10, JDG 11, ENV 06 |
+| ENV 02 | ENV 03, ENV 07, ENV 10, TST 01, API 02 (partial → full) |
+| ENV 03 | ENV 04, ENV 05, TST 02, TST 03 |
+| ENV 04 | ENV 05, TST 02 |
+| ENV 05 | ENV 06, TST 02 |
+| ENV 06 | ENV 07, ENV 09, ENV 11, API 03 (partial → full), API 06 (partial → full), OBS 07 |
+| API 06 | TRN 03, TRN 13 |
+| API 09 | API 10, API 17 |
+| TST 07 | No new dependencies |
+| ENV 07 | ENV 10 (partial unblock) |
+| ENV 08 | API 07 (partial → full) |
+| TST 01 | No new dependencies |
+| TST 02 | No new dependencies |
+| TST 03 | No new dependencies |
+| API 02 | API 14 (partial → closer to full), UI 06 |
+| TRN 13 | TRN 03 now has both its dependencies met (API 06 + TRN 13) |
+| API 08 | API 09, API 16, API 19 |
 ### Current Unblocked and Active Tasks
 |----|-------|------|-------------|
 | FND 13 | Kush (Person D) | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 |
 | UI 01 | Kush (Person D) | Create application shell with three panel layout | FND 03 |
+| AGT 09 | Kian (Person A) | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 |
 | MOD 06 | Kian (Person A) | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 |
 | MOD 07 | Max (Person C) | Add state serialization helper for replay logs | MOD 04 |
 | SCN 13 | Kian (Person A) | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 |
 | DOC 01 | Kush (Person D) | Write hook, problem statement, and one line product summary | FND 06 |
+| JDG 06 | Kian (Person A) | Add optional plain English explanation function from reward breakdown | JDG 05 |
+| JDG 08 | Kian (Person A) | Add score determinism tests and edge case tests | JDG 01 to JDG 05 |
+| ENV 10 | Kian (Person A) | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 |
+| JDG 09 | Kush (Person D) | Create mock score cards and language for frontend | JDG 05 |
+| API 10 | Max (Person C) | Deploy live Space and verify health, reset, and step | API 09 |
+| API 17 | Max (Person C) | Document secrets and API key management for hosted deployment and Colab | API 09 |
+| TRN 03 | Person B (Ayush) | Implement env client wrapper for training rollouts | API 06, TRN 13 |
+Note: Person B (Ayush) has `TRN 03` as the next unblocked task (`TRN 13` is now complete). `AGT 10` still waits on `JDG 06` (Person A). The remaining TRN chain waits on `API 10` (Person C) and judge tasks (Person A).
 ---
 | E01. Foundations and repository setup | 13 | 12 | 92.31% |
 | E02. Domain models, validation, state contracts | 12 | 8 | 66.67% |
 | E03. Scenario engine and constraint generation | 13 | 11 | 84.62% |
+| E04. Scientist agent and Lab Manager policy | 11 | 9 | 81.82% |
+| E05. Judge engine and reward logic | 11 | 5 | 45.45% |
+| E06. OpenEnv environment implementation | 11 | 8 | 72.73% |
+| E07. API, server, Docker, deployment | 19 | 9 | 47.37% |
+| E08. RL training pipeline and evaluation | 15 | 1 | 6.67% |
 | E09. Frontend, UX, replay, demo views | 15 | 0 | 0% |
 | E10. Logging, replay, and observability | 9 | 0 | 0% |
+| E11. Testing and quality gates | 12 | 6 | 50.00% |
 | E12. README, demo video, submission packaging | 11 | 0 | 0% |

docs/fnd08_frozen_json_contract.md CHANGED Viewed

@@ -16,6 +16,17 @@ This document freezes the JSON contract for the shared ReplicaLab data models so
 - Person C API payload examples
 - Person D frontend and replay mocks
 ## Global conventions
 - All JSON keys use `snake_case`.

 - Person C API payload examples
 - Person D frontend and replay mocks
+## Tool-Capability Addendum
+The richer-capability MVP adds bounded search, code-check, and image-inspection
+support below this frozen contract.
+This addendum does **not** reopen the outward action schema from `FND 08`.
+The final outward actions remain `ScientistAction` and `LabManagerAction`.
+Bounded tool use will be represented through scenario or evidence metadata,
+environment-side tool traces, and `StepResult.info` or replay payloads rather
+than new outward action types for the MVP.
 ## Global conventions
 - All JSON keys use `snake_case`.

docs/kian/task_breakdown.md CHANGED Viewed

@@ -6,28 +6,39 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 ## Current status
-- `FND 04`, `FND 08`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
-- Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the later judge feasibility score
-- `SCN 01` to `SCN 10` are also complete, so the deterministic scenario layer now exists in code
-- The Kian lane no longer needs to start with scenario seeding or template scaffolding
-- The remaining high-leverage work is semantic edge-case validation, booking conflicts, judge logic, and the real environment
 ---
 ## Recommended execution order
-1. `MOD 06` -- extend the new semantic validation layer to catch impossible edge cases early
-2. `SCN 13` -- deepen the normalized scenario layer with booking and scheduling conflicts
-3. `JDG 01`, `JDG 02`, and `JDG 03` -- start the deterministic reward components that are now unblocked
-4. `JDG 04` and `JDG 05` -- complete the reward pipeline once the component scorers exist
-5. `ENV 01` and `ENV 02` -- once typed state and core scoring pieces are in place, start the real OpenEnv environment path
 ---
 ## Why this order
 - `MOD 06` is the smallest remaining contract-hardening task and builds directly on the completed `MOD 05` validator.
-- `SCN 13` is the remaining scenario-layer depth task; it builds naturally on the completed normalized resource model.
-- `JDG 01` and `JDG 03` can start immediately because their only formal prerequisite, `SCN 08`, is already complete.
-- `JDG 02` is now also unblocked because the deterministic feasibility checker from `AGT 05` exists.
-- The environment path can now start from typed state and step-result contracts instead of loose dict-based placeholders.

 ## Current status
+- `FND 04`, `FND 08`, `FND 09`, `MOD 01` to `MOD 05`, `MOD 11`, `MOD 12` are complete
+- Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the judge feasibility score
+- `SCN 01` to `SCN 10` are complete, so the deterministic scenario layer exists in code
+- `ENV 01` to `ENV 08` are all complete — the full environment lifecycle (reset, step, validate, Lab Manager response, termination, judge scoring, state snapshot, close) works end-to-end
+- `JDG 01` to `JDG 05` are complete — the full deterministic reward pipeline (rigor, feasibility, fidelity, total reward formula with floor clamp, breakdown builder with named penalty extension point) is wired and tested
+- `TST 01` to `TST 05` are complete with 36 env tests and 26 reward tests passing
+- The remaining high-leverage work is semantic edge-case validation, booking conflicts, judge explanation output, and environment test suite expansion
+Bounded-tool scope note:
+1. Kian-owned scenario, judge, and environment tasks now need to support
+   bounded `search`, `code_check`, and `image_inspection` traces without
+   changing the outer action contract.
+2. Training reward must remain deterministic and must not depend on live web.
+3. Frozen evidence packs are the default training-time source of tool inputs.
+4. Audio remains out of scope.
 ---
 ## Recommended execution order
+1. `MOD 06` -- extend the semantic validation layer to catch impossible edge cases early
+2. `SCN 13` -- deepen the normalized scenario layer with booking, scheduling, and evidence-pack support
+3. `JDG 06` -- add plain English explanation function from reward breakdown (unblocks AGT 10 for Ayush)
+4. `JDG 08` -- add score determinism tests and edge case tests
+5. `ENV 10` -- add comprehensive env tests (reset, step, invalid action, timeout, deterministic replay)
 ---
 ## Why this order
 - `MOD 06` is the smallest remaining contract-hardening task and builds directly on the completed `MOD 05` validator.
+- `SCN 13` is the remaining scenario-layer depth task; it now also needs to carry booking-conflict and evidence-pack data in a deterministic way.
+- `JDG 06` is the highest-leverage remaining judge task because it directly unblocks `AGT 10` (Ayush's prompt text files) and `JDG 11` (structured audit payload).
+- `JDG 08` builds on the now-complete JDG 01-05 pipeline to add regression coverage for score ordering and edge cases.
+- `ENV 10` builds on the complete ENV 01-08 lifecycle to add comprehensive environment test coverage.

docs/kian/task_list.md CHANGED Viewed

@@ -11,8 +11,10 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
 - `SCN 01` to `SCN 10` are now complete in the repo
 - The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
-- The next Kian-lane tasks are now `MOD 06`, `SCN 13`, `JDG 01`, `JDG 02`, `JDG 03`, and `ENV 01`
-- `MOD 05` and shared `AGT 05` now exist, so the judge and environment path can build on real scenario-grounded checks instead of placeholder rules
 ---
@@ -20,10 +22,9 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - [ ] **MOD 06** | Add semantic validators for impossible plans such as zero sample size with positive controls | 0.75h | Depends: MOD 05
 - [ ] **SCN 13** | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time-slot conflicts and duration | 1h | Depends: SCN 07
-- [ ] **JDG 01** | Implement rigor or objective-validity score for plan completeness, required checks, method quality, and justification | 1.25h | Depends: SCN 08
-- [ ] **JDG 02** | Implement feasibility score for budget, resources, time, staffing, compute, and bookings | 1.25h | Depends: SCN 07, AGT 05
-- [ ] **JDG 03** | Implement fidelity score against hidden reference spec, required steps, and allowed substitutions | 1h | Depends: SCN 08
-- [ ] **ENV 01** | Create `ReplicaLabEnv` class skeleton | 0.5h | Depends: MOD 04, SCN 09
 ---
@@ -50,3 +51,21 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - [x] **SCN 08** | Completed by Person B (Ayush)
 - [x] **SCN 09** | Completed by Person B (Ayush)
 - [x] **SCN 10** | Completed by Person B (Ayush)

 - Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
 - `SCN 01` to `SCN 10` are now complete in the repo
 - The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
+- `ENV 01` to `ENV 08` are now complete, so the full environment lifecycle (reset, step, validate, Lab Manager response, termination, judge scoring, state snapshot, close) works end-to-end
+- `JDG 01` to `JDG 05` are now complete, so the deterministic reward pipeline (rigor, feasibility, fidelity, total reward formula, breakdown builder) is fully wired
+- `TST 01` to `TST 05` are now complete, with 36 env tests and 26 reward tests passing
+- The next Kian-lane tasks are `MOD 06`, `SCN 13`, `JDG 06`, `JDG 08`, `ENV 10`
 ---
 - [ ] **MOD 06** | Add semantic validators for impossible plans such as zero sample size with positive controls | 0.75h | Depends: MOD 05
 - [ ] **SCN 13** | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time-slot conflicts and duration | 1h | Depends: SCN 07
+- [ ] **JDG 06** | Add optional plain English explanation function from reward breakdown | 0.75h | Depends: JDG 05
+- [ ] **JDG 08** | Add score determinism tests and edge case tests | 1h | Depends: JDG 01 to JDG 05
+- [ ] **ENV 10** | Add reset, step, invalid action, timeout, and deterministic replay tests | 1.25h | Depends: ENV 02 to ENV 09
 ---
 - [x] **SCN 08** | Completed by Person B (Ayush)
 - [x] **SCN 09** | Completed by Person B (Ayush)
 - [x] **SCN 10** | Completed by Person B (Ayush)
+- [x] **ENV 01** | Completed by Person B (Ayush)
+- [x] **ENV 02** | Completed by Person B (Ayush)
+- [x] **ENV 03** | Completed by Person B (Ayush)
+- [x] **ENV 04** | Completed by Person B (Ayush)
+- [x] **ENV 05** | Completed by Person B (Ayush)
+- [x] **ENV 06** | Completed by Person B (Ayush)
+- [x] **ENV 07** | Completed by Person B (Ayush)
+- [x] **ENV 08** | Completed by Person B (Ayush)
+- [x] **JDG 01** | Completed by Person B (Ayush)
+- [x] **JDG 02** | Completed by Person B (Ayush)
+- [x] **JDG 03** | Completed by Person B (Ayush)
+- [x] **JDG 04** | Completed by Person B (Ayush)
+- [x] **JDG 05** | Completed by Person B (Ayush)
+- [x] **TST 01** | Completed by Person B (Ayush)
+- [x] **TST 02** | Completed by Person B (Ayush)
+- [x] **TST 03** | Completed by Person B (Ayush)
+- [x] **TST 04** | Completed by Person B (Ayush)
+- [x] **TST 05** | Completed by Person B (Ayush)

docs/map/scoring.md CHANGED Viewed

@@ -1,19 +1,21 @@
 # Scoring Map — `replicalab/scoring/`
 > Judge scoring engine for protocol evaluation.
-> Pure deterministic functions — no LLM calls, no side effects.
 >
-> **Tasks implemented:** JDG 01, JDG 02, JDG 03
-> **Tasks remaining:** JDG 04-08
 ## Architecture
 ```
 replicalab/scoring/
-    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity
     rigor.py             # JDG 01 — protocol structural quality
     feasibility.py       # JDG 02 — resource feasibility (wraps AGT 05)
     fidelity.py          # JDG 03 — adherence to hidden reference spec
 ```
 ## Shared Utilities
@@ -142,15 +144,40 @@ This is the key difference from JDG 01's element check.
 ---
-## Not Yet Implemented
-### `compute_reward(protocol, scenario, ...) -> RewardBreakdown` — JDG 04/05
-Combines rigor + feasibility + fidelity with weights.
-Applies efficiency bonus (rounds used), communication bonus, and penalties.
 ### Bonuses & Penalties — JDG 06-08
-- `efficiency_bonus`: reward for finishing in fewer rounds
-- `communication_bonus`: reward for clear negotiation
 - `penalties`: policy violations, hallucinated resources, etc.
 ## Data Consumed

 # Scoring Map — `replicalab/scoring/`
 > Judge scoring engine for protocol evaluation.
+> Pure deterministic functions — no model calls, no side effects.
 >
+> **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05
+> **Tasks remaining:** JDG 06-08
 ## Architecture
 ```
 replicalab/scoring/
+    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity,
+                         #          build_reward_breakdown, compute_total_reward
     rigor.py             # JDG 01 — protocol structural quality
     feasibility.py       # JDG 02 — resource feasibility (wraps AGT 05)
     fidelity.py          # JDG 03 — adherence to hidden reference spec
+    rubric.py            # JDG 04-05 — total reward formula and breakdown builder
 ```
 ## Shared Utilities
 ---
+---
+## JDG 04 — `compute_total_reward(breakdown) -> float`
+**File:** `rubric.py`
+**Formula:** `10 × rigor × feasibility × fidelity + efficiency_bonus + communication_bonus − sum(penalties)`
+Returns a scalar reward from a `RewardBreakdown` object.
+## JDG 05 — `build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown`
+**File:** `rubric.py`
+**Composes:** rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus.
+### Efficiency Bonus
+- Max bonus: 1.0 (configurable via `_MAX_EFFICIENCY_BONUS`)
+- Formula: `bonus × (max_rounds - rounds_used) / (max_rounds - 1)`
+- Finishing in round 1 of 6 → maximum bonus; using all rounds → 0
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `compute_total_reward(breakdown)` | Apply the reward formula |
+| `build_reward_breakdown(...)` | Compose all sub-scores into a breakdown |
+| `_efficiency_bonus(rounds_used, max_rounds)` | Compute efficiency bonus |
+---
+## Not Yet Implemented
 ### Bonuses & Penalties — JDG 06-08
+- `explanation_function`: optional plain English from reward breakdown (JDG 06)
+- `communication_bonus`: reward for clear negotiation (reserved)
 - `penalties`: policy violations, hallucinated resources, etc.
 ## Data Consumed

docs/map/server.md CHANGED Viewed

@@ -1,128 +1,80 @@
-# Server Map — `server/app.py`
-> FastAPI backend with REST + WebSocket endpoints and stub environment.
 >
-> **Tasks implemented:** API 01-04, 06 (partial)
-## Environment
 ### `_StubEnv`
-Minimal environment stub used until the real `ReplicaLabEnv` is implemented (ENV 01-11).
-**State:**
-| Attribute | Type | Purpose |
-|-----------|------|---------|
-| `_state` | `EpisodeState` | Full episode state |
-| `_episode_id` | `str` | UUID for this episode |
-| `_scenario_pack` | `NormalizedScenarioPack \| None` | Stored for lab manager pipeline |
-| `_logs` | `list[ConversationEntry]` | Conversation transcript |
-**Methods:**
-| Method | Returns | Behavior |
-|--------|---------|----------|
-| `reset(seed, scenario, difficulty)` | `Observation` | Generates scenario, builds initial observations |
-| `step(action: ScientistAction)` | `StepResult` | Processes scientist action, runs lab manager pipeline |
-| `state()` | `EpisodeState` | Returns current state snapshot |
-| `episode_id()` | `str` | Returns episode UUID |
-| `close()` | `None` | No-op |
-**Lab Manager Integration (AGT 07):**
-The `_lab_manager_action()` method runs the full deterministic pipeline:
-1. `check_feasibility(protocol, scenario_pack)` → `FeasibilityCheckResult`
-2. `suggest_alternative(protocol, check_result, scenario_pack)` → `AlternativeSuggestion | None`
-3. `compose_lab_manager_response(check_result, suggestion)` → `LabManagerAction`
-**Termination logic:**
-- Episode ends (`done=True`) when `agreement_reached=True` (both agents accept)
-- `agreement_reached` when lab manager action_type is `accept` (2-round stub logic)
-- On termination: reward = `STUB_ACCEPT_REWARD` (5.0)
-### `_make_env() -> _StubEnv`
-Factory that tries to import `ReplicaLabEnv` from `replicalab.env`, falls back to `_StubEnv`.
-## REST Endpoints
 ### `GET /health`
-Returns `{"status": "ok"}`.
 ### `POST /reset`
-**Request:** `ResetRequest`
-| Field | Type | Default |
-|-------|------|---------|
-| `seed` | `int \| None` | `None` (random) |
-| `scenario` | `str` | `DEFAULT_SCENARIO_TEMPLATE` |
-| `difficulty` | `str` | `DEFAULT_DIFFICULTY` |
-| `session_id` | `str \| None` | `None` (auto-generated) |
-**Response:** `ResetResponse`
-| Field | Type |
-|-------|------|
-| `session_id` | `str` |
-| `episode_id` | `str` |
-| `observation` | `Observation` |
-### `POST /step`
-**Request:** `StepRequest`
-| Field | Type |
-|-------|------|
-| `session_id` | `str` |
-| `action` | `ScientistAction` |
-**Response:** `StepResult` (observation, reward, done, info)
-When `done=True`, the episode log is stored in `_replay_store`.
 ### `GET /scenarios`
-Returns `available_scenario_families()` — list of families with difficulties.
 ### `GET /replay/{episode_id}`
-Returns `EpisodeLog` for a completed episode, or 404 if not found.
-## WebSocket Endpoint
 ### `WS /ws`
-Bidirectional session with JSON messages.
-**Client → Server messages:**
-| Type | Payload | Behavior |
-|------|---------|----------|
-| `reset` | `{seed, scenario, difficulty}` | Creates env, returns initial state |
-| `step` | `{action: ScientistAction}` | Steps env, returns result |
-| `ping` | — | Returns `{"type": "pong"}` |
-**Server → Client messages:**
-| Type | Payload |
-|------|---------|
-| `state` | `{observation, episode_id}` |
-| `step_result` | `StepResult.info.model_dump()` |
-| `pong` | `{}` |
-| `error` | `{message}` |
-## Session Management
-| Store | Type | Purpose |
-|-------|------|---------|
-| `_sessions` | `dict[str, dict]` | Active REST sessions (env + last_active) |
-| `_replay_store` | `dict[str, EpisodeLog]` | Completed episode logs |
-**Cleanup:** Background task runs every 60s, removes sessions older than `SESSION_TTL_SECONDS` (300s).
-## Helper Functions
 | Function | Purpose |
-|----------|---------|
-| `_reward_breakdown_from_state(state)` | Extract RewardBreakdown from EpisodeState scores |
-| `_build_episode_log(episode_id, state)` | Build EpisodeLog from final state |
-| `_touch(session_id)` | Update last_active timestamp |
-| `_cleanup_stale_sessions()` | Remove expired sessions |
-## Dependencies
-```python
-from replicalab.agents import check_feasibility, compose_lab_manager_response, suggest_alternative
-from replicalab.config import API_HOST, API_PORT, DEFAULT_DIFFICULTY, ...
-from replicalab.models import (ConversationEntry, EpisodeLog, EpisodeState, LabManagerAction,
-                                Observation, Protocol, RewardBreakdown, ScientistAction, StepInfo, StepResult, ...)
-from replicalab.scenarios import NormalizedScenarioPack, available_scenario_families, generate_scenario
-```

+# Server Map - `server/app.py`
+> FastAPI backend with REST + WebSocket endpoints. The normal path now uses
+> the real `ReplicaLabEnv`; `_StubEnv` remains only as a fallback if the env
+> package cannot be imported.
 >
+> **Tasks implemented:** API 01-09, 13, 15
+## Environment path
+### `ReplicaLabEnv`
+Primary environment implementation imported from
+`replicalab.env.replicalab_env`.
 ### `_StubEnv`
+Legacy fallback kept so the server can still boot if the real env import
+fails. It is no longer the intended local or Docker runtime.
+### `_make_env()`
+Factory that prefers `ReplicaLabEnv` and falls back to `_StubEnv` only on
+import failure.
+## REST endpoints
 ### `GET /health`
+Returns a liveness payload. When the real env path is active, the response
+includes `env: "real"`.
 ### `POST /reset`
+Starts a new episode and returns:
+- `session_id`
+- `episode_id`
+- typed `Observation`
+### `POST /step`
+Submits a typed `ScientistAction` and returns `StepResult`.
+When `done=true`, the terminal `StepResult` is also used to build the replay
+log so `reward_breakdown`, `judge_notes`, and `verdict` stay aligned with the
+real env result.
 ### `GET /scenarios`
+Returns the available scenario families and supported difficulties.
 ### `GET /replay/{episode_id}`
+Returns the stored `EpisodeLog` for a completed episode or 404 if not found.
+## WebSocket endpoint
 ### `WS /ws`
+Per-connection isolated environment session supporting:
+- `reset`
+- `step`
+- `ping`
+Idle timeout and disconnect cleanup are implemented and verified.
+## Session management
+| Store | Purpose |
+| --- | --- |
+| `_sessions` | Active REST sessions |
+| `_replay_store` | Completed episode logs |
+## Key helpers
 | Function | Purpose |
+| --- | --- |
+| `_build_episode_log(episode_id, state, result)` | Build replay log from final state and terminal step result |
+| `_touch(session_id)` | Refresh REST session last-active timestamp |
+| `_cleanup_stale_sessions()` | Remove expired REST sessions |
+## Current deployment state
+- Local OpenEnv validation passes
+- Local Docker build and run verification passes
+- HF Spaces metadata is present in the root `README.md` and root `Dockerfile`
+- Live hosted verification remains `API 10`

docs/map/tests.md CHANGED Viewed

@@ -1,8 +1,8 @@
 # Tests Map — `tests/`
-> 134 tests across 8 files. All passing.
 >
-> **Last verified:** 2026-03-07
 ## Summary
@@ -14,15 +14,17 @@
 | `test_validation.py` | 13 | Protocol validation checks |
 | `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
 | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
-| `test_reward.py` | 18 | JDG 01-03 scoring functions |
-| `test_server.py` | 5 | API endpoint integration |
-| **Total** | **134** | |
 ## Missing Coverage (not yet implemented)
 | File (planned) | Would cover |
 |---------------|-------------|
-| `test_env.py` | ENV 01-11 real environment |
 ---
@@ -182,6 +184,185 @@
 | `test_compose_lab_manager_response_reports_non_lab_issues` | Policy-only → REPORT |
 | `test_compose_lab_manager_response_uses_custom_renderer_without_changing_verdict` | Custom renderer works |
 ## Test Helpers
 ### Shared fixtures in test files
@@ -192,3 +373,13 @@
 | `_base_observation(**overrides)` | test_scientist_policy | Build ScientistObservation with defaults |
 | `_make_system_prompt()` | test_scientist_policy | Build prompt from math_reasoning scenario |
 | `_VALID_REQUEST_INFO_JSON` | test_scientist_policy | Valid request_info JSON string |

 # Tests Map — `tests/`
+> 231 tests across 10 files. All passing.
 >
+> **Last verified:** 2026-03-08
 ## Summary
 | `test_validation.py` | 13 | Protocol validation checks |
 | `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
 | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
+| `test_reward.py` | 26 | JDG 01-05 scoring functions |
+| `test_env.py` | 36 | ENV 01-08, JDG 04-05, TST 01-03 |
+| `test_server.py` | 34 | API endpoint integration (API 02-04, 06-07, 13) |
+| `test_client.py` | 24 | TRN 13 client module (REST + WS transports) |
+| **Total** | **231** | |
 ## Missing Coverage (not yet implemented)
 | File (planned) | Would cover |
 |---------------|-------------|
+| `test_env.py` (expand) | ENV 10 full reset/step/replay tests |
 ---
 | `test_compose_lab_manager_response_reports_non_lab_issues` | Policy-only → REPORT |
 | `test_compose_lab_manager_response_uses_custom_renderer_without_changing_verdict` | Custom renderer works |
+## `test_reward.py` (18 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering |
+| `test_rigor_is_deterministic` | Same inputs → same output |
+| `test_rigor_empty_controls_reduces_score` | Controls matter |
+| `test_rigor_short_rationale_reduces_score` | Rationale length matters |
+| `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 |
+| `test_feasibility_infeasible_protocol_scores_lower` | Bad < good |
+| `test_feasibility_accepts_precomputed_check` | Pre-computed = computed |
+| `test_feasibility_is_deterministic` | Same inputs → same output |
+| `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over |
+| `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned |
+| `test_fidelity_is_deterministic` | Same inputs → same output |
+| `test_fidelity_substitution_gets_partial_credit` | Sub > miss |
+| `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps |
+| `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check |
+| `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency |
+## `test_env.py` (32 tests)
+### TST 01 — Reset (8 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_reset_returns_observation_with_both_roles` | Both scientist + lab_manager present |
+| `test_reset_scientist_fields_populated` | Paper title, hypothesis, goal, round 0 |
+| `test_reset_lab_manager_fields_populated` | Budget, staff, time limit populated |
+| `test_reset_preserves_booked_and_out_of_stock` | ENV 02 scenario-pack data preserved |
+| `test_reset_state_round_zero` | State starts at round 0, not done |
+| `test_reset_generates_episode_id` | UUID episode ID generated |
+| `test_reset_clears_previous_episode` | Second reset clears first episode |
+| `test_reset_all_templates_and_difficulties` | All 9 template/difficulty combos work |
+### TST 03 — Invalid Action (4 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_invalid_duration_returns_error_string` | Validation error returned |
+| `test_env_survives_after_invalid_action` | Env still accepts valid actions after error |
+| `test_invalid_action_does_not_advance_round` | Round stays at 0 |
+| `test_request_info_always_passes_validation` | Non-proposal actions skip validation |
+### TST 02 — Step and Terminal Path (8 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_step_advances_round_number` | Round increments |
+| `test_step_returns_observations` | Both roles in step result |
+| `test_step_records_conversation_history` | Scientist + LM entries logged |
+| `test_accept_with_protocol_terminates` | Accept → done=True |
+| `test_accept_terminal_step_has_real_reward` | ENV 06 real scores, not stub 0.8 |
+| `test_max_rounds_terminates` | Max rounds → done, no agreement |
+| `test_step_info_has_round_and_episode_id` | Metadata populated |
+| `test_full_episode_propose_then_accept` | Full 2-step episode |
+### ENV 07 — State Snapshot (2 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_state_is_deep_copy` | Mutating snapshot doesn't affect env |
+| `test_state_history_is_independent` | History list is independent copy |
+### ENV 08 — Close/Reopen (3 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_close_is_idempotent` | Double close doesn't throw |
+| `test_step_after_close_raises` | RuntimeError on step after close |
+| `test_reset_reopens_closed_env` | Reset clears closed state |
+### JDG 04-05 — Rubric (7 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_compute_total_reward_formula` | 10×r×f×fi + bonuses = expected |
+| `test_compute_total_reward_with_penalties` | Penalties subtracted correctly |
+| `test_compute_total_reward_zero_scores` | Zero dimension → zero reward |
+| `test_build_reward_breakdown_returns_valid_scores` | All sub-scores in [0,1] |
+| `test_build_reward_breakdown_efficiency_bonus` | Fewer rounds → higher bonus |
+| `test_build_reward_breakdown_is_deterministic` | Same inputs → same output |
+| `test_total_reward_matches_manual_calculation` | Cross-check formula |
+## `test_server.py` (34 tests)
+### GET /scenarios — API 04 (5 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_returns_200` | Endpoint returns 200 |
+| `test_response_has_scenarios_key` | Response has `scenarios` list |
+| `test_all_families_present` | All 3 families present |
+| `test_each_family_has_difficulties` | Each has easy/medium/hard |
+| `test_no_extra_keys` | Only `family` and `difficulties` keys |
+### CORS — API 13 (3 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_preflight_allows_localhost_vite_origin` | localhost:5173 allowed |
+| `test_preflight_allows_hf_space_origin` | HF Spaces origin allowed |
+| `test_preflight_rejects_unconfigured_origin` | Unknown origin → 400 |
+### POST /reset — API 02 (7 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_reset_returns_200_with_expected_keys` | 200 with session_id, episode_id, observation |
+| `test_reset_observation_has_both_roles` | Scientist + lab_manager present |
+| `test_reset_with_explicit_session_id_reuses_slot` | Same session_id reused |
+| `test_reset_reuse_closes_prior_env` | New episode on reuse |
+| `test_reset_default_params` | Defaults work without error |
+| `test_reset_custom_scenario_and_difficulty` | All 9 combos succeed |
+| `test_reset_deterministic_with_same_seed` | Same seed → same observation |
+### POST /step — API 03 (5 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_reset_then_step_happy_path` | Reset → step returns 200 with StepResult |
+| `test_step_invalid_session_returns_404` | Non-existent session → 404 |
+| `test_terminal_step_returns_real_reward_breakdown` | Accept has real scores, not stub 0.8 |
+| `test_semantic_invalid_action_returns_200_with_error` | Invalid duration → 200 with info.error |
+| `test_replay_uses_real_judge_data` | Replay has real judge_notes, not stub |
+### WebSocket — API 06 (12 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_ws_ping_pong` | Ping → pong |
+| `test_ws_reset_returns_observation` | Reset returns episode_id + observation |
+| `test_ws_step_returns_result` | Step returns step_ok with result |
+| `test_ws_full_episode_real_reward` | Propose → accept returns real scores |
+| `test_ws_invalid_json` | Bad JSON → error |
+| `test_ws_missing_action_field` | Missing action → error |
+| `test_ws_invalid_action_payload` | Invalid action schema → error |
+| `test_ws_unknown_message_type` | Unknown type → error |
+| `test_ws_session_isolation` | Two connections have independent env state |
+| `test_ws_semantic_invalid_action_returns_step_ok_with_info_error` | Invalid duration → step_ok with info.error |
+| `test_ws_timeout_verdict` | Max rounds → done, timeout verdict |
+| `test_ws_terminal_episode_persists_real_replay_log` | WS episode → /replay has real data |
+### WebSocket Idle Timeout — API 07 (2 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_ws_idle_timeout_closes_connection` | No messages → server closes with code 1000 |
+| `test_ws_env_closes_on_disconnect` | env.close() called in finally block on disconnect |
+## `test_client.py` (24 tests)
+### REST Transport (10 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_connect_succeeds` | REST connect hits /health |
+| `test_connect_bad_url_raises` | Bad URL raises |
+| `test_reset_returns_observation` | reset() returns typed Observation |
+| `test_reset_sets_session_and_episode_id` | IDs set after reset |
+| `test_reset_reuses_session` | Same session_id on re-reset |
+| `test_step_returns_step_result` | step() returns typed StepResult |
+| `test_step_before_reset_raises` | step() without reset raises |
+| `test_full_episode_propose_accept` | Full episode with reward > 0 |
+| `test_replay_after_episode` | replay() returns typed EpisodeLog |
+| `test_context_manager_closes` | `with` block sets connected=False |
+### WebSocket Transport (11 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_connect_succeeds` | WS connect opens connection |
+| `test_connect_bad_url_raises` | Bad URL raises |
+| `test_reset_returns_observation` | reset() returns typed Observation |
+| `test_reset_sets_episode_id` | episode_id set after reset |
+| `test_ws_session_id_is_none` | WS has no session_id |
+| `test_step_returns_step_result` | step() returns typed StepResult |
+| `test_full_episode_propose_accept` | Full episode with reward > 0 |
+| `test_semantic_invalid_action_step_ok_with_error` | Invalid action → info.error |
+| `test_context_manager_closes` | `with` block sets connected=False |
+| `test_state_not_supported` | state() raises NotImplementedError |
+| `test_replay_not_supported` | replay() raises NotImplementedError |
+### Constructor (3 tests)
+| Test | What it verifies |
+|------|-----------------|
+| `test_unknown_transport_raises` | "grpc" → ValueError |
+| `test_not_connected_raises_on_reset` | reset() without connect raises |
+| `test_default_transport_is_websocket` | Default is _WsTransport |
 ## Test Helpers
 ### Shared fixtures in test files
 | `_base_observation(**overrides)` | test_scientist_policy | Build ScientistObservation with defaults |
 | `_make_system_prompt()` | test_scientist_policy | Build prompt from math_reasoning scenario |
 | `_VALID_REQUEST_INFO_JSON` | test_scientist_policy | Valid request_info JSON string |
+| `_scenario(template, difficulty)` | test_env | Generate scenario with seed=42 |
+| `_good_action(scenario)` | test_env | Build valid propose_protocol action |
+| `_accept_action()` | test_env | Build valid accept action |
+| `_good_protocol(scenario)` | test_env | Build well-formed protocol |
+| `_reset(client, **kwargs)` | test_server | Reset and return response JSON |
+| `_good_action_payload(client)` | test_server | Build valid propose_protocol payload |
+| `_accept_action_payload()` | test_server | Build valid accept payload |
+| `_propose_action(obs)` | test_client | Build valid propose_protocol ScientistAction |
+| `_accept_action()` | test_client | Build valid accept ScientistAction |
+| `live_server` | test_client | Module-scoped uvicorn server fixture |

docs/max/deployment.md CHANGED Viewed

@@ -28,7 +28,7 @@ curl http://localhost:7860/health
 curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
-  -d '{"seed": 42, "scenario": "cell_biology", "difficulty": "easy"}'
 ```
 ---
@@ -40,6 +40,30 @@ docker build -f server/Dockerfile -t replicalab .
 docker run -p 7860:7860 replicalab
 ```
 With optional hosted-model secrets:
 ```bash
@@ -50,27 +74,138 @@ docker run -p 7860:7860 \
 ---
-## Hosted Space Deployment
-The repository is not yet marked as fully deployed. Use this section as the deployment checklist for the later API 09, API 10, API 15, and API 17 tasks.
-### One-time setup
-1. Create a Space with Docker support.
-2. Add the Space as a remote.
-3. Push the repository once the Docker path and README metadata are finalized.
-4. Verify `/health`, `/reset`, and `/ws` after the Space build finishes.
-### Secrets checklist
-If the deployed server needs hosted-model credentials later, set them in the platform secret store rather than committing them to the repo.
-Suggested secret names:
-| Secret name | Purpose |
-|-------------|---------|
-| `MODEL_API_KEY` | Hosted model access key |
-| `MODEL_BASE_URL` | Optional alternate provider endpoint |
 ---
@@ -105,7 +240,7 @@ When hosted deployment is eventually verified:
 | Issue | Fix |
 |-------|-----|
-| `ReplicaLabEnv not found` warning at startup | Normal while the real env implementation has not landed; the server will use the stub env |
 | Docker build fails | Re-check `server/requirements.txt` and the Docker build context |
 | CORS error from the frontend | Re-check allowed origins in `server/app.py` |
 | WebSocket closes after idle time | Send periodic ping messages or reconnect |

 curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
+  -d '{"seed": 42, "scenario": "math_reasoning", "difficulty": "easy"}'
 ```
 ---
 docker run -p 7860:7860 replicalab
 ```
+### Verified endpoints (API 08 sign-off, 2026-03-08)
+After `docker run -p 7860:7860 replicalab`, the following were verified
+against the **real env** (not stub):
+```bash
+curl http://localhost:7860/health
+# → {"status":"ok","env":"real"}
+curl http://localhost:7860/scenarios
+# → {"scenarios":[{"family":"math_reasoning",...}, ...]}
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"seed":42,"scenario":"math_reasoning","difficulty":"easy"}'
+# → {"session_id":"...","episode_id":"...","observation":{...}}
+# Use session_id from reset response:
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"session_id":"<SESSION_ID>","action":{"action_type":"propose_protocol","sample_size":3,"controls":["baseline"],"technique":"algebraic_proof","duration_days":1,"required_equipment":[],"required_reagents":[],"questions":[],"rationale":"Test."}}'
+# → {"observation":{...},"reward":0.0,"done":false,"info":{...}}
+```
 With optional hosted-model secrets:
 ```bash
 ---
+## Hugging Face Spaces Deployment
+### What is already configured (API 09)
+The repo is now deployment-ready for HF Spaces:
+- **Root `Dockerfile`** — HF Spaces requires the Dockerfile at repo root.
+  The root-level `Dockerfile` is identical to `server/Dockerfile`. Keep them
+  in sync, or delete `server/Dockerfile` once the team standardizes.
+- **`README.md` frontmatter** — The root README now contains the required
+  YAML frontmatter that HF Spaces parses on push:
+  ```yaml
+  ---
+  title: ReplicaLab
+  emoji: 🧪
+  colorFrom: blue
+  colorTo: green
+  sdk: docker
+  app_port: 7860
+  pinned: false
+  ---
+  ```
+- **Non-root user** — The Dockerfile creates and runs as `appuser` (UID 1000),
+  which HF Spaces requires for security.
+- **Port 7860** — Both the `EXPOSE` directive and the `uvicorn` CMD use 7860,
+  matching the `app_port` in the frontmatter.
+### Step-by-step deployment (for Max)
+#### 1. Create the Space
+1. Go to https://huggingface.co/new-space
+2. Fill in:
+   - **Owner:** your HF username or the team org
+   - **Space name:** `replicalab` (or `replicalab-demo`)
+   - **License:** MIT
+   - **SDK:** Docker
+   - **Hardware:** CPU Basic (free tier is fine for the server)
+   - **Visibility:** Public
+3. Click **Create Space**
+#### 2. Add the Space as a git remote
+```bash
+# From the repo root
+git remote add hf https://huggingface.co/spaces/<YOUR_HF_USERNAME>/replicalab
+# If the org is different:
+# git remote add hf https://huggingface.co/spaces/<ORG>/replicalab
+```
+#### 3. Push the repo
+```bash
+# Push the current branch to the Space
+git push hf ayush:main
+# Or if deploying from master:
+# git push hf master:main
+```
+HF Spaces will automatically detect the `Dockerfile`, build the image, and
+start the container.
+#### 4. Monitor the build
+1. Go to https://huggingface.co/spaces/\<YOUR_HF_USERNAME\>/replicalab
+2. Click the **Logs** tab (or **Build** tab during first deploy)
+3. Wait for the build to complete (typically 2-5 minutes)
+4. The Space status should change from "Building" to "Running"
+#### 5. Verify the deployment (API 10 scope)
+Once the Space is running:
+```bash
+# Health check
+curl https://<space-name>.hf.space/health
+# Reset an episode
+curl -X POST https://<space-name>.hf.space/reset \
+  -H "Content-Type: application/json" \
+  -d '{"seed": 42, "scenario": "math_reasoning", "difficulty": "easy"}'
+# List scenarios
+curl https://<space-name>.hf.space/scenarios
+```
+WebSocket test (using websocat or wscat):
+```bash
+wscat -c wss://<space-name>.hf.space/ws
+# Then type: {"type": "ping"}
+# Expect: {"type": "pong"}
+```
+### Secrets configuration
+If the deployed server needs hosted-model credentials later (e.g. for a
+frontier evaluator), set them in the HF Space secret store:
+1. Go to the Space **Settings** tab
+2. Scroll to **Repository secrets**
+3. Add each secret:
+| Secret name | Purpose | Required now? |
+|-------------|---------|---------------|
+| `MODEL_API_KEY` | Hosted model access key (for frontier evaluator) | No — only for demo-time evaluator |
+| `MODEL_BASE_URL` | Optional alternate provider endpoint | No |
+Secrets are injected as environment variables at container runtime.
+Access them in Python with `os.environ.get("MODEL_API_KEY")`.
+### Re-deploying after code changes
+```bash
+# Just push again — HF rebuilds automatically
+git push hf ayush:main
+```
+To force a full rebuild (e.g. after dependency changes):
+1. Go to Space **Settings**
+2. Click **Factory reboot** under the Danger zone section
+### Known limitations
+- **Free CPU tier** has 2 vCPU and 16 GB RAM. This is sufficient for the
+  FastAPI server but NOT for running RL training. Training happens in Colab.
+- **Cold starts** — Free-tier Spaces sleep after 48 hours of inactivity.
+  The first request after sleep takes 30-60 seconds to rebuild.
+- **Persistent storage** — Episode replays and logs are in-memory only.
+  They reset when the container restarts. This is acceptable for the
+  hackathon demo.
 ---
 | Issue | Fix |
 |-------|-----|
+| `ReplicaLabEnv not found` warning at startup | The real env is now available; ensure `replicalab/scoring/rubric.py` is present and `httpx` + `websocket-client` are in `server/requirements.txt` |
 | Docker build fails | Re-check `server/requirements.txt` and the Docker build context |
 | CORS error from the frontend | Re-check allowed origins in `server/app.py` |
 | WebSocket closes after idle time | Send periodic ping messages or reconnect |

docs/max/task_breakdown.md CHANGED Viewed

@@ -10,29 +10,34 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
 - `FND 03` and `FND 12` are now complete via the validated frontend import from Kush's branch onto `ayush`
 - `FND 11` is now complete and verified
-- A normalized backend import from Max's PR is on `ayush`: `server/app.py`, `server/Dockerfile`, and `docs/max/deployment.md`
-- That backend import is intentionally tracked as partial because it still runs on the stub env and Docker has not yet been validated locally
-- Max's remaining implementation priority is the real-env-backed API and deployment path
 ---
 ## Unblocked now
-1. Convert the stub-backed API tasks to real-env-backed implementations once Kian lands the environment work
-2. Validate Docker locally once the real env path is in place
 ---
 ## Still blocked
 - `FND 13` is now unblocked because `FND 03` is complete, but it remains owned by Kush (Person D)
-- Real completion of `API 01`, `API 02`, `API 03`, `API 06`, and `API 07` depends on Kian's environment tasks
-- Real completion of `API 08` depends on local Docker build and run validation
 ---
 ## Recommended execution order
-1. Re-validate the imported server scaffold against Kian's environment implementation
-2. Validate `server/Dockerfile` locally
-3. Continue into deployment and replay work once the real env path is stable

 - Those tasks were executed by `Person B (Ayush)` and logged in `docs/changes.md`
 - `FND 03` and `FND 12` are now complete via the validated frontend import from Kush's branch onto `ayush`
 - `FND 11` is now complete and verified
+- The backend path is now real-env-backed locally: `server/app.py` imports `ReplicaLabEnv`, `openenv validate` passes, and local Docker verification is complete
+- `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 09`, `API 13`, and `API 15` are complete
+- `API 01`, `API 14`, and `OBS 02` are the remaining partial tasks in Max's lane
+- Max's remaining implementation priority is live Space deployment, replay persistence, observability, and the remaining API polish
 ---
 ## Unblocked now
+1. `API 10` is now unblocked because HF metadata and deployment instructions are in place
+2. `API 17` is now unblocked because `API 09` is complete
+3. Replay or persistence work (`MOD 07`, `ENV 09`, `API 05`, `JDG 07`) is now the next infra-heavy backend chain
 ---
 ## Still blocked
 - `FND 13` is now unblocked because `FND 03` is complete, but it remains owned by Kush (Person D)
+- `API 05` still depends on `ENV 09`
+- `API 16` still depends on `UI 10`
+- `API 18` still depends on `API 05` and `ENV 11`
+- `API 19` still depends on `API 10`
 ---
 ## Recommended execution order
+1. Ship `API 10` live Space deployment verification
+2. Ship `API 17` secrets and hosted-key documentation
+3. Finish `API 01`, `API 14`, and `OBS 02` sign-off work
+4. Move into replay persistence (`MOD 07` -> `ENV 09` -> `API 05`)

docs/max/task_list.md CHANGED Viewed

@@ -6,22 +6,20 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 ## Current status
-- `FND 01`, `FND 02`, `FND 05`, `FND 07`, and `FND 10` are complete
-- All five were executed by `Person B (Ayush)` and recorded as executor deviations
-- `FND 03` is complete via the validated frontend import from Kush's branch onto `ayush`
-- `FND 11` is complete
-- `FND 12` is complete via the imported and validated `frontend/vite.config.ts`
-- A stub-backed backend server scaffold now exists in `server/app.py`
-- `API 01`, `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 13`, `API 14`, and `OBS 02` are partial pending real-env and Docker-level verification
-- The remaining Max work is now the API, Docker, deployment, replay, and observability path
 ---
 ## Immediate next tasks
-- [ ] **API 01 / API 02 / API 03 / API 06** | Convert the stub-backed server scaffold into real-env-backed endpoints | Depends: `ENV 01`, `ENV 02`, `ENV 06` | Status: partial
-- [ ] **API 08** | Validate Docker locally for the server image | Depends: `API 01` to `API 07` | Status: partial
-- [ ] **OBS 02** | Confirm logging behavior against the integrated environment path | Depends: `API 01` | Status: partial
 ---
@@ -38,4 +36,14 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 ## Completed in Max's lane
 - [x] **FND 11** | Completed and verified in `server/requirements.txt`

 ## Current status
+- `FND 01`, `FND 02`, `FND 03`, `FND 05`, `FND 07`, `FND 10`, `FND 11`, and `FND 12` are complete
+- The server now runs against the real `ReplicaLabEnv`, not just the legacy stub fallback
+- `API 02`, `API 03`, `API 04`, `API 06`, `API 07`, `API 08`, `API 09`, `API 13`, and `API 15` are complete
+- `API 01`, `API 14`, and `OBS 02` remain partial
+- The remaining Max work is now live deployment verification, replay or persistence work, observability, and the rest of the API or packaging path
 ---
 ## Immediate next tasks
+- [ ] **API 10** | Deploy the live HF Space and verify `/health`, `/reset`, and `/step` end to end | Depends: `API 09`
+- [ ] **API 17** | Document secrets and API key management for HF Space and Colab | Depends: `API 09`
+- [ ] **API 01 / API 14 / OBS 02** | Finish the remaining partial server tasks and sign-offs | Depends: real-env server path already present
+- [ ] **MOD 07 / ENV 09 / API 05** | Finish replay persistence and replay retrieval path | Depends: `MOD 04`, `ENV 06`
 ---
 ## Completed in Max's lane
 - [x] **FND 11** | Completed and verified in `server/requirements.txt`
+- [x] **API 02** | Completed by Person B (Ayush)
+- [x] **API 03** | Completed by Person B (Ayush)
+- [x] **API 04** | Completed by Person B (Ayush)
+- [x] **API 06** | Completed by Person B (Ayush)
+- [x] **API 07** | Completed by Person B (Ayush)
+- [x] **API 08** | Completed by Person B (Ayush)
+- [x] **API 09** | Completed by Person B (Ayush)
+- [x] **API 13** | Completed by Person B (Ayush)
+- [x] **API 15** | Completed by Person B (Ayush)
+- [x] **TST 07** | Completed by Person B (Ayush)

replicalab/__init__.py CHANGED Viewed

	@@ -0,0 +1,3 @@


1	+ from replicalab.client import ReplicaLabClient
2	+
3	+ __all__ = ["ReplicaLabClient"]

replicalab/agents/scientist_policy.py CHANGED Viewed

@@ -5,7 +5,7 @@ MOD 09 introduced strict parsing from raw model output into
 builder so prompt assembly can be driven by the normalized scenario pack
 instead of hard-coded domain text. AGT 02 adds the per-turn observation
 formatter that converts a ``ScientistObservation`` into the user message
-sent to the LLM each round. AGT 03 wraps the formatter and parser in a
 retry loop with error-specific correction prompts and exposed telemetry.
 AGT 04 adds a deterministic baseline Scientist so smoke tests can run
 without a trained model.
@@ -139,7 +139,7 @@ def call_scientist_with_retry(
     *,
     max_retries: int = 2,
 ) -> ScientistCallResult:
-    """Call an LLM to produce a ``ScientistAction`` with parser-driven retries.
     On parse failure the error is fed back to the model as a correction
     prompt and the model is asked to try again, up to *max_retries* times.
@@ -279,6 +279,18 @@ def build_scientist_system_prompt(
             "For accept, questions must be empty and protocol-edit fields must stay "
             "empty or zero."
         ),
     ]
     return "\n\n".join(section for section in sections if section)
@@ -344,7 +356,7 @@ def format_scientist_observation(obs: ScientistObservation) -> str:
 def build_baseline_scientist_action(
     observation: ScientistObservation,
 ) -> ScientistAction:
-    """Return a deterministic non-LLM Scientist action for smoke tests.
     The baseline follows a conservative policy:
     - propose a valid protocol when no protocol exists yet

 builder so prompt assembly can be driven by the normalized scenario pack
 instead of hard-coded domain text. AGT 02 adds the per-turn observation
 formatter that converts a ``ScientistObservation`` into the user message
+sent to the model each round. AGT 03 wraps the formatter and parser in a
 retry loop with error-specific correction prompts and exposed telemetry.
 AGT 04 adds a deterministic baseline Scientist so smoke tests can run
 without a trained model.
     *,
     max_retries: int = 2,
 ) -> ScientistCallResult:
+    """Call a model backend to produce a ``ScientistAction`` with parser-driven retries.
     On parse failure the error is fed back to the model as a correction
     prompt and the model is asked to try again, up to *max_retries* times.
             "For accept, questions must be empty and protocol-edit fields must stay "
             "empty or zero."
         ),
+        (
+            "Bounded tool policy: you have access to three bounded tools. "
+            "search_evidence retrieves supporting facts from frozen evidence packs. "
+            "run_code_check performs bounded code analysis, config validation, and "
+            "derived-value computation. "
+            "inspect_image extracts information from figures, tables, charts, and "
+            "screenshots. "
+            "Rules: use tools only to support or verify claims within the current "
+            "scenario constraints. Tools do not override constraints, loosen limits, "
+            "or reveal hidden ground truth. No unrestricted web browsing. No audio "
+            "capabilities. No autonomous code execution beyond bounded analysis."
+        ),
     ]
     return "\n\n".join(section for section in sections if section)
 def build_baseline_scientist_action(
     observation: ScientistObservation,
 ) -> ScientistAction:
+    """Return a deterministic baseline Scientist action for smoke tests.
     The baseline follows a conservative policy:
     - propose a valid protocol when no protocol exists yet

replicalab/client.py ADDED Viewed

	@@ -0,0 +1,333 @@

+"""Reusable environment client for ReplicaLab (TRN 13).
+Wraps both REST and WebSocket server transports behind a unified
+sync interface.  Consumers (notebook, training loop, eval scripts)
+import this module instead of duplicating connection logic.
+Usage::
+    from replicalab.client import ReplicaLabClient
+    with ReplicaLabClient("http://localhost:7860", transport="websocket") as client:
+        obs = client.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+        while True:
+            action = policy(obs)
+            result = client.step(action)
+            obs = result.observation
+            if result.done:
+                break
+"""
+from __future__ import annotations
+import json
+import threading
+from typing import Optional
+import httpx
+import websocket as ws_lib  # websocket-client
+from replicalab.config import (
+    API_PORT,
+    DEFAULT_DIFFICULTY,
+    DEFAULT_SCENARIO_TEMPLATE,
+)
+from replicalab.models import (
+    EpisodeLog,
+    Observation,
+    ScientistAction,
+    StepInfo,
+    StepResult,
+)
+__all__ = ["ReplicaLabClient"]
+# ---------------------------------------------------------------------------
+# Transport backends
+# ---------------------------------------------------------------------------
+class _RestTransport:
+    """Sync REST transport using httpx."""
+    def __init__(self, base_url: str, timeout: float) -> None:
+        self._base_url = base_url.rstrip("/")
+        self._http = httpx.Client(base_url=self._base_url, timeout=timeout)
+        self._session_id: Optional[str] = None
+        self._episode_id: Optional[str] = None
+    # -- lifecycle -----------------------------------------------------------
+    def connect(self) -> None:
+        resp = self._http.get("/health")
+        resp.raise_for_status()
+    def close(self) -> None:
+        self._session_id = None
+        self._episode_id = None
+        self._http.close()
+    # -- env operations ------------------------------------------------------
+    def reset(
+        self,
+        seed: int,
+        scenario: str,
+        difficulty: str,
+    ) -> Observation:
+        payload: dict = {
+            "seed": seed,
+            "scenario": scenario,
+            "difficulty": difficulty,
+        }
+        if self._session_id is not None:
+            payload["session_id"] = self._session_id
+        resp = self._http.post("/reset", json=payload)
+        resp.raise_for_status()
+        data = resp.json()
+        self._session_id = data["session_id"]
+        self._episode_id = data["episode_id"]
+        return Observation.model_validate(data["observation"])
+    def step(self, action: ScientistAction) -> StepResult:
+        if self._session_id is None:
+            raise RuntimeError("Call reset() before step()")
+        resp = self._http.post(
+            "/step",
+            json={
+                "session_id": self._session_id,
+                "action": action.model_dump(),
+            },
+        )
+        resp.raise_for_status()
+        return StepResult.model_validate(resp.json())
+    def state(self) -> dict:
+        if self._session_id is None:
+            raise RuntimeError("Call reset() before state()")
+        resp = self._http.get(f"/state/{self._session_id}")
+        resp.raise_for_status()
+        return resp.json()
+    def replay(self, episode_id: str) -> EpisodeLog:
+        resp = self._http.get(f"/replay/{episode_id}")
+        resp.raise_for_status()
+        return EpisodeLog.model_validate(resp.json())
+    # -- properties ----------------------------------------------------------
+    @property
+    def session_id(self) -> Optional[str]:
+        return self._session_id
+    @property
+    def episode_id(self) -> Optional[str]:
+        return self._episode_id
+class _WsTransport:
+    """Sync WebSocket transport using websocket-client."""
+    def __init__(self, base_url: str, timeout: float) -> None:
+        # Convert http(s):// → ws(s)://
+        ws_url = base_url.rstrip("/")
+        ws_url = ws_url.replace("https://", "wss://").replace("http://", "ws://")
+        self._ws_url = ws_url + "/ws"
+        self._timeout = timeout
+        self._ws: Optional[ws_lib.WebSocket] = None
+        self._episode_id: Optional[str] = None
+        self._lock = threading.Lock()
+    # -- lifecycle -----------------------------------------------------------
+    def connect(self) -> None:
+        self._ws = ws_lib.create_connection(
+            self._ws_url, timeout=self._timeout
+        )
+    def close(self) -> None:
+        if self._ws is not None:
+            try:
+                self._ws.close()
+            except Exception:
+                pass
+            self._ws = None
+        self._episode_id = None
+    # -- low-level send/recv -------------------------------------------------
+    def _send(self, payload: dict) -> dict:
+        if self._ws is None:
+            raise RuntimeError("Call connect() before sending messages")
+        with self._lock:
+            self._ws.send(json.dumps(payload))
+            raw = self._ws.recv()
+        data = json.loads(raw)
+        if data.get("type") == "error":
+            raise RuntimeError(f"Server error: {data.get('message', '')}")
+        return data
+    # -- env operations ------------------------------------------------------
+    def reset(
+        self,
+        seed: int,
+        scenario: str,
+        difficulty: str,
+    ) -> Observation:
+        data = self._send({
+            "type": "reset",
+            "seed": seed,
+            "scenario": scenario,
+            "difficulty": difficulty,
+        })
+        if data.get("type") != "reset_ok":
+            raise RuntimeError(f"Unexpected response type: {data.get('type')}")
+        self._episode_id = data.get("episode_id")
+        return Observation.model_validate(data["observation"])
+    def step(self, action: ScientistAction) -> StepResult:
+        data = self._send({
+            "type": "step",
+            "action": action.model_dump(),
+        })
+        if data.get("type") != "step_ok":
+            raise RuntimeError(f"Unexpected response type: {data.get('type')}")
+        return StepResult(
+            observation=Observation.model_validate(data["observation"])
+            if data.get("observation")
+            else None,
+            reward=data.get("reward", 0.0),
+            done=data.get("done", False),
+            info=StepInfo.model_validate(data.get("info", {})),
+        )
+    def state(self) -> dict:
+        raise NotImplementedError(
+            "state() is not available over WebSocket. Use REST transport or "
+            "track state from step() results."
+        )
+    def replay(self, episode_id: str) -> EpisodeLog:
+        raise NotImplementedError(
+            "replay() is not available over WebSocket. Use REST transport or "
+            "a separate httpx call to GET /replay/{episode_id}."
+        )
+    # -- properties ----------------------------------------------------------
+    @property
+    def session_id(self) -> Optional[str]:
+        return None  # WS sessions are implicit per-connection
+    @property
+    def episode_id(self) -> Optional[str]:
+        return self._episode_id
+# ---------------------------------------------------------------------------
+# Public client
+# ---------------------------------------------------------------------------
+class ReplicaLabClient:
+    """Reusable sync client for the ReplicaLab environment server.
+    Parameters
+    ----------
+    base_url:
+        Server URL, e.g. ``"http://localhost:7860"``.
+    transport:
+        ``"websocket"`` (default) or ``"rest"``.
+    timeout:
+        Request/connection timeout in seconds.
+    """
+    def __init__(
+        self,
+        base_url: str = f"http://localhost:{API_PORT}",
+        *,
+        transport: str = "websocket",
+        timeout: float = 30.0,
+    ) -> None:
+        if transport == "websocket":
+            self._transport: _RestTransport | _WsTransport = _WsTransport(
+                base_url, timeout
+            )
+        elif transport == "rest":
+            self._transport = _RestTransport(base_url, timeout)
+        else:
+            raise ValueError(f"Unknown transport: {transport!r}. Use 'websocket' or 'rest'.")
+        self._connected = False
+    # -- context manager -----------------------------------------------------
+    def __enter__(self) -> "ReplicaLabClient":
+        self.connect()
+        return self
+    def __exit__(self, *exc) -> None:
+        self.close()
+    # -- lifecycle -----------------------------------------------------------
+    def connect(self) -> None:
+        """Open the connection to the server."""
+        self._transport.connect()
+        self._connected = True
+    def close(self) -> None:
+        """Close the connection and release resources."""
+        self._transport.close()
+        self._connected = False
+    # -- env operations ------------------------------------------------------
+    def reset(
+        self,
+        seed: int = 0,
+        scenario: str = DEFAULT_SCENARIO_TEMPLATE,
+        difficulty: str = DEFAULT_DIFFICULTY,
+    ) -> Observation:
+        """Start a new episode. Returns the initial observation."""
+        self._ensure_connected()
+        return self._transport.reset(seed, scenario, difficulty)
+    def step(self, action: ScientistAction) -> StepResult:
+        """Submit a Scientist action. Returns the step result."""
+        self._ensure_connected()
+        return self._transport.step(action)
+    def state(self) -> dict:
+        """Get current episode state (REST only)."""
+        self._ensure_connected()
+        return self._transport.state()
+    def replay(self, episode_id: str) -> EpisodeLog:
+        """Fetch a completed episode log (REST only)."""
+        self._ensure_connected()
+        return self._transport.replay(episode_id)
+    # -- properties ----------------------------------------------------------
+    @property
+    def session_id(self) -> Optional[str]:
+        """REST session ID, or ``None`` for WebSocket transport."""
+        return self._transport.session_id
+    @property
+    def episode_id(self) -> Optional[str]:
+        """Current episode ID set after the most recent ``reset()``."""
+        return self._transport.episode_id
+    @property
+    def connected(self) -> bool:
+        """Whether ``connect()`` has been called."""
+        return self._connected
+    # -- internal ------------------------------------------------------------
+    def _ensure_connected(self) -> None:
+        if not self._connected:
+            raise RuntimeError("Client not connected. Call connect() or use as context manager.")

replicalab/scoring/__init__.py CHANGED Viewed

@@ -3,8 +3,11 @@
 from .feasibility import score_feasibility
 from .fidelity import score_fidelity
 from .rigor import score_rigor
 __all__ = [
     "score_feasibility",
     "score_fidelity",
     "score_rigor",

 from .feasibility import score_feasibility
 from .fidelity import score_fidelity
 from .rigor import score_rigor
+from .rubric import build_reward_breakdown, compute_total_reward
 __all__ = [
+    "build_reward_breakdown",
+    "compute_total_reward",
     "score_feasibility",
     "score_fidelity",
     "score_rigor",

replicalab/scoring/rubric.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""JDG 04-05 — Total reward computation and reward breakdown builder.
+Combines rigor (JDG 01), feasibility (JDG 02), and fidelity (JDG 03)
+into a single scalar reward with efficiency bonus and penalties.
+Formula:  total = 10 × rigor × feasibility × fidelity + bonuses − penalties
+Pure deterministic functions — no model calls, no side effects.
+"""
+from __future__ import annotations
+from replicalab.agents.lab_manager_policy import (
+    FeasibilityCheckResult,
+    check_feasibility,
+)
+from replicalab.models import Protocol, RewardBreakdown
+from replicalab.scenarios.templates import NormalizedScenarioPack
+from replicalab.scoring.feasibility import score_feasibility
+from replicalab.scoring.fidelity import score_fidelity
+from replicalab.scoring.rigor import score_rigor
+_REWARD_SCALE = 10.0
+_MAX_EFFICIENCY_BONUS = 1.0
+_MAX_COMMUNICATION_BONUS = 0.0  # reserved for future use
+def compute_total_reward(breakdown: RewardBreakdown) -> float:
+    """Compute the scalar reward from a RewardBreakdown.
+    Formula: 10 × rigor × feasibility × fidelity + efficiency_bonus
+             + communication_bonus − sum(penalties)
+    """
+    base = _REWARD_SCALE * breakdown.rigor * breakdown.feasibility * breakdown.fidelity
+    bonus = breakdown.efficiency_bonus + breakdown.communication_bonus
+    penalty = sum(breakdown.penalties.values())
+    return max(0.0, round(base + bonus - penalty, 6))
+def build_reward_breakdown(
+    protocol: Protocol,
+    scenario: NormalizedScenarioPack,
+    rounds_used: int,
+    max_rounds: int,
+    *,
+    check: FeasibilityCheckResult | None = None,
+    penalties: dict[str, float] | None = None,
+) -> RewardBreakdown:
+    """Build a full RewardBreakdown from the three sub-scores plus bonuses.
+    Parameters
+    ----------
+    protocol : Protocol
+        The final agreed protocol.
+    scenario : NormalizedScenarioPack
+        The scenario pack for this episode.
+    rounds_used : int
+        How many rounds were consumed.
+    max_rounds : int
+        The episode's round cap.
+    check : FeasibilityCheckResult, optional
+        Pre-computed feasibility check to avoid redundant work.
+    penalties : dict[str, float], optional
+        Named penalty keys for bounded-tool diagnostics, unsupported
+        evidence claims, or other deterministic deductions.  Use named
+        keys (e.g. ``"invalid_tool_use"``, ``"unsupported_claim"``)
+        instead of adding new fields to RewardBreakdown.
+    """
+    if check is None:
+        check = check_feasibility(protocol, scenario)
+    rigor = score_rigor(protocol, scenario)
+    feasibility = score_feasibility(protocol, scenario, check=check)
+    fidelity = score_fidelity(protocol, scenario)
+    efficiency_bonus = _efficiency_bonus(rounds_used, max_rounds)
+    merged_penalties = dict(penalties) if penalties else {}
+    return RewardBreakdown(
+        rigor=rigor,
+        feasibility=feasibility,
+        fidelity=fidelity,
+        efficiency_bonus=efficiency_bonus,
+        communication_bonus=0.0,
+        penalties=merged_penalties,
+    )
+def _efficiency_bonus(rounds_used: int, max_rounds: int) -> float:
+    """Reward finishing in fewer rounds.
+    If the scientist reaches agreement in round 1 of 6, that's the maximum
+    bonus.  If they use all rounds, the bonus is 0.
+    """
+    if max_rounds <= 1 or rounds_used <= 0:
+        return 0.0
+    saved = max(0, max_rounds - rounds_used)
+    return round(_MAX_EFFICIENCY_BONUS * saved / (max_rounds - 1), 6)

replicalab/utils/logging.py ADDED Viewed

	@@ -0,0 +1,159 @@

+"""Episode logging and replay persistence helpers.
+MOD 07 provides the persistence boundary for episode replay, notebook
+inspection, and later API replay retrieval.  All writes are atomic
+(temp file + rename) so a crash never leaves a half-written replay.
+"""
+from __future__ import annotations
+import csv
+import io
+import os
+import tempfile
+from pathlib import Path
+from typing import TypeVar
+from pydantic import BaseModel
+from replicalab.models import EpisodeLog
+_M = TypeVar("_M", bound=BaseModel)
+_DEFAULT_REPLAYS_DIR = Path(__file__).resolve().parents[2] / "replicalab" / "outputs" / "replays"
+_DEFAULT_LOGS_DIR = Path(__file__).resolve().parents[2] / "replicalab" / "outputs" / "logs"
+# ---------------------------------------------------------------------------
+# Internal helper — atomic JSON write for any Pydantic model
+# ---------------------------------------------------------------------------
+def _write_json_model(model: BaseModel, path: Path) -> Path:
+    """Serialize a Pydantic model to *path* atomically.
+    Writes to a temporary file in the same directory, then renames so
+    readers never see a partial file.
+    """
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    data = model.model_dump_json(indent=2)
+    fd, tmp = tempfile.mkstemp(dir=str(path.parent), suffix=".tmp")
+    try:
+        with os.fdopen(fd, "w", encoding="utf-8") as fh:
+            fh.write(data)
+        # On Windows, target must not exist for os.rename; use replace.
+        os.replace(tmp, str(path))
+    except BaseException:
+        # Clean up the temp file on any failure.
+        try:
+            os.unlink(tmp)
+        except OSError:
+            pass
+        raise
+    return path
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def write_episode_log(
+    log: EpisodeLog,
+    directory: Path | str | None = None,
+) -> Path:
+    """Persist a completed episode log as JSON.
+    Parameters
+    ----------
+    log:
+        The completed episode record.
+    directory:
+        Target directory.  Defaults to ``replicalab/outputs/replays/``.
+    Returns
+    -------
+    Path
+        Absolute path to the written file.
+    """
+    directory = Path(directory) if directory is not None else _DEFAULT_REPLAYS_DIR
+    filename = f"{log.episode_id}.json" if log.episode_id else "unknown.json"
+    return _write_json_model(log, directory / filename)
+def load_episode_log(path: Path | str) -> EpisodeLog:
+    """Load an episode log from a JSON file.
+    Raises
+    ------
+    FileNotFoundError
+        If *path* does not exist.
+    pydantic.ValidationError
+        If the file contents do not match the ``EpisodeLog`` schema.
+    """
+    path = Path(path)
+    raw = path.read_text(encoding="utf-8")
+    return EpisodeLog.model_validate_json(raw)
+def append_reward_csv(
+    path: Path | str | None = None,
+    *,
+    episode_id: str = "",
+    seed: int = 0,
+    scenario_template: str = "",
+    difficulty: str = "",
+    total_reward: float = 0.0,
+    rigor: float = 0.0,
+    feasibility: float = 0.0,
+    fidelity: float = 0.0,
+    rounds_used: int = 0,
+    agreement_reached: bool = False,
+) -> Path:
+    """Append one row to a reward CSV file.
+    Creates the file with a header if it does not exist.
+    Pre-stages the format that JDG 07 will consume.
+    """
+    path = Path(path) if path is not None else _DEFAULT_LOGS_DIR / "rewards.csv"
+    path.parent.mkdir(parents=True, exist_ok=True)
+    fieldnames = [
+        "episode_id",
+        "seed",
+        "scenario_template",
+        "difficulty",
+        "total_reward",
+        "rigor",
+        "feasibility",
+        "fidelity",
+        "rounds_used",
+        "agreement_reached",
+    ]
+    write_header = not path.exists() or path.stat().st_size == 0
+    row = {
+        "episode_id": episode_id,
+        "seed": seed,
+        "scenario_template": scenario_template,
+        "difficulty": difficulty,
+        "total_reward": total_reward,
+        "rigor": rigor,
+        "feasibility": feasibility,
+        "fidelity": fidelity,
+        "rounds_used": rounds_used,
+        "agreement_reached": agreement_reached,
+    }
+    with open(path, "a", newline="", encoding="utf-8") as fh:
+        writer = csv.DictWriter(fh, fieldnames=fieldnames)
+        if write_header:
+            writer.writeheader()
+        writer.writerow(row)
+    return path

server/Dockerfile CHANGED Viewed

@@ -17,8 +17,8 @@ COPY replicalab/ ./replicalab/
 COPY server/ ./server/
 COPY pyproject.toml ./
-# Install the replicalab package in editable mode
-RUN pip install --no-cache-dir -e . --no-deps
 # Run as a non-root user inside the container
 RUN useradd -m -u 1000 appuser && chown -R appuser /app

 COPY server/ ./server/
 COPY pyproject.toml ./
+# Install the replicalab package (non-editable, deps already present)
+RUN pip install --no-cache-dir . --no-deps
 # Run as a non-root user inside the container
 RUN useradd -m -u 1000 appuser && chown -R appuser /app

server/app.py CHANGED Viewed

@@ -76,7 +76,7 @@ logging.basicConfig(
 log = logging.getLogger("replicalab.server")
 # ---------------------------------------------------------------------------
-# Environment factory — swap _StubEnv for ReplicaLabEnv once Person A ships it
 # ---------------------------------------------------------------------------
 try:
@@ -89,21 +89,17 @@ except ImportError:
     log.warning("ReplicaLabEnv not found — using _StubEnv (replace when Person A ships env)")
-def _reward_breakdown_from_state(state: EpisodeState) -> RewardBreakdown:
-    return RewardBreakdown(
-        rigor=state.rigor_score,
-        feasibility=state.feasibility_score,
-        fidelity=state.fidelity_score,
-        efficiency_bonus=0.0,
-        communication_bonus=0.0,
-        penalties={
-            "invalid_action": 0.0,
-            "timeout": 0.0,
-        },
-    )
-def _build_episode_log(episode_id: str, state: EpisodeState) -> EpisodeLog:
     return EpisodeLog(
         episode_id=episode_id,
         seed=state.seed,
@@ -111,12 +107,12 @@ def _build_episode_log(episode_id: str, state: EpisodeState) -> EpisodeLog:
         difficulty=state.difficulty,
         final_state=state,
         transcript=list(state.conversation_history),
-        reward_breakdown=_reward_breakdown_from_state(state),
-        total_reward=state.reward,
         rounds_used=state.round_number,
-        agreement_reached=state.agreement_reached,
-        judge_notes="Stub audit until judge integration lands.",
-        verdict="accept" if state.agreement_reached else "revise",
     )
@@ -198,7 +194,11 @@ class _StubEnv:
             info=StepInfo(
                 agreement_reached=self._state.agreement_reached,
                 error=None,
-                reward_breakdown=_reward_breakdown_from_state(self._state) if done else None,
                 judge_notes="Stub audit until judge integration lands." if done else None,
                 verdict=("accept" if self._state.agreement_reached else "revise") if done else None,
                 round=self._state.round_number,
@@ -416,6 +416,10 @@ class ResetResponse(BaseModel):
     observation: Observation
 class StepRequest(BaseModel):
     session_id: str
     action: ScientistAction
@@ -431,9 +435,9 @@ async def health():
     return {"status": "ok", "env": "real" if _HAS_REAL_ENV else "stub"}
-@app.get("/scenarios")
 async def list_scenarios():
-    return {"scenarios": SCENARIOS}
 @app.post("/reset", response_model=ResetResponse)
@@ -478,6 +482,7 @@ async def step_episode(req: StepRequest):
         _replay_store[session["episode_id"]] = _build_episode_log(
             session["episode_id"],
             state,
         )
         log.info(
             "Episode done | session=%s episode=%s reward=%.2f",
@@ -596,7 +601,9 @@ async def websocket_endpoint(ws: WebSocket):
                     # Store completed episode for REST replay
                     if result.done and episode_id:
                         state = env.state()
-                        _replay_store[episode_id] = _build_episode_log(episode_id, state)
                     await _ws_send(
                         ws,

 log = logging.getLogger("replicalab.server")
 # ---------------------------------------------------------------------------
+# Environment factory — prefer ReplicaLabEnv, retain _StubEnv only as fallback
 # ---------------------------------------------------------------------------
 try:
     log.warning("ReplicaLabEnv not found — using _StubEnv (replace when Person A ships env)")
+def _build_episode_log(
+    episode_id: str,
+    state: EpisodeState,
+    result: StepResult,
+) -> EpisodeLog:
+    """Build an EpisodeLog from the terminal StepResult.
+    Uses the real reward_breakdown, judge_notes, and verdict from the env
+    instead of rebuilding from state with stale stub values.
+    """
+    info = result.info
     return EpisodeLog(
         episode_id=episode_id,
         seed=state.seed,
         difficulty=state.difficulty,
         final_state=state,
         transcript=list(state.conversation_history),
+        reward_breakdown=info.reward_breakdown,
+        total_reward=result.reward,
         rounds_used=state.round_number,
+        agreement_reached=info.agreement_reached,
+        judge_notes=info.judge_notes or "",
+        verdict=info.verdict or "",
     )
             info=StepInfo(
                 agreement_reached=self._state.agreement_reached,
                 error=None,
+                reward_breakdown=RewardBreakdown(
+                    rigor=self._state.rigor_score,
+                    feasibility=self._state.feasibility_score,
+                    fidelity=self._state.fidelity_score,
+                ) if done else None,
                 judge_notes="Stub audit until judge integration lands." if done else None,
                 verdict=("accept" if self._state.agreement_reached else "revise") if done else None,
                 round=self._state.round_number,
     observation: Observation
+class ScenariosResponse(BaseModel):
+    scenarios: list[dict]
 class StepRequest(BaseModel):
     session_id: str
     action: ScientistAction
     return {"status": "ok", "env": "real" if _HAS_REAL_ENV else "stub"}
+@app.get("/scenarios", response_model=ScenariosResponse)
 async def list_scenarios():
+    return ScenariosResponse(scenarios=SCENARIOS)
 @app.post("/reset", response_model=ResetResponse)
         _replay_store[session["episode_id"]] = _build_episode_log(
             session["episode_id"],
             state,
+            result,
         )
         log.info(
             "Episode done | session=%s episode=%s reward=%.2f",
                     # Store completed episode for REST replay
                     if result.done and episode_id:
                         state = env.state()
+                        _replay_store[episode_id] = _build_episode_log(
+                            episode_id, state, result
+                        )
                     await _ws_send(
                         ws,

server/requirements.txt CHANGED Viewed

@@ -2,3 +2,5 @@ fastapi>=0.115,<1.0
 uvicorn[standard]>=0.34,<1.0
 websockets>=15.0,<17.0
 pydantic>=2.7,<3.0

 uvicorn[standard]>=0.34,<1.0
 websockets>=15.0,<17.0
 pydantic>=2.7,<3.0
+httpx>=0.27,<1.0
+websocket-client>=1.7,<2.0

tests/fixtures/api_schema_examples.json ADDED Viewed

	@@ -0,0 +1,491 @@

+{
+  "_meta": {
+    "generated_by": "tests/fixtures/generate_api_examples.py",
+    "description": "API schema examples generated from real Pydantic models. Re-run the script to regenerate after contract changes.",
+    "seed": 42,
+    "scenario_template": "math_reasoning",
+    "difficulty": "easy"
+  },
+  "rest": {
+    "POST /reset": {
+      "request": {
+        "seed": 42,
+        "scenario": "math_reasoning",
+        "difficulty": "easy",
+        "session_id": null
+      },
+      "response": {
+        "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
+        "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
+        "observation": {
+          "scientist": {
+            "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
+            "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
+            "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+            "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
+            "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
+            "conversation_history": [],
+            "current_protocol": null,
+            "round_number": 0,
+            "max_rounds": 6
+          },
+          "lab_manager": {
+            "budget_total": 345.0,
+            "budget_remaining": 345.0,
+            "equipment_available": [
+              "Structured proof notebook"
+            ],
+            "equipment_booked": [],
+            "reagents_in_stock": [
+              "Reference theorem library",
+              "Graduate reviewer"
+            ],
+            "reagents_out_of_stock": [],
+            "staff_count": 1,
+            "time_limit_days": 3,
+            "safety_restrictions": [
+              "The outline should stay concise enough for seminar notes."
+            ],
+            "conversation_history": [],
+            "current_protocol": null,
+            "round_number": 0,
+            "max_rounds": 6
+          }
+        }
+      }
+    },
+    "POST /step": {
+      "request": {
+        "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
+        "action": {
+          "action_type": "propose_protocol",
+          "sample_size": 30,
+          "controls": [
+            "positive_control",
+            "negative_control"
+          ],
+          "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+          "duration_days": 5,
+          "required_equipment": [
+            "Structured proof notebook"
+          ],
+          "required_reagents": [
+            "Reference theorem library",
+            "Graduate reviewer"
+          ],
+          "questions": [],
+          "rationale": "Initial proposal using available resources."
+        }
+      },
+      "response_mid_episode": {
+        "observation": {
+          "scientist": {
+            "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
+            "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
+            "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+            "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
+            "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
+            "conversation_history": [
+              {
+                "role": "scientist",
+                "message": "Initial proposal using available resources.",
+                "round_number": 1,
+                "action_type": "propose_protocol"
+              },
+              {
+                "role": "lab_manager",
+                "message": "Budget is within range. Equipment is available.",
+                "round_number": 1,
+                "action_type": "report_feasibility"
+              }
+            ],
+            "current_protocol": {
+              "sample_size": 30,
+              "controls": [
+                "positive_control",
+                "negative_control"
+              ],
+              "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+              "duration_days": 5,
+              "required_equipment": [
+                "Structured proof notebook"
+              ],
+              "required_reagents": [
+                "Reference theorem library",
+                "Graduate reviewer"
+              ],
+              "rationale": "Initial proposal using available resources."
+            },
+            "round_number": 1,
+            "max_rounds": 6
+          },
+          "lab_manager": {
+            "budget_total": 345.0,
+            "budget_remaining": 345.0,
+            "equipment_available": [
+              "Structured proof notebook"
+            ],
+            "equipment_booked": [],
+            "reagents_in_stock": [
+              "Reference theorem library",
+              "Graduate reviewer"
+            ],
+            "reagents_out_of_stock": [],
+            "staff_count": 1,
+            "time_limit_days": 3,
+            "safety_restrictions": [
+              "The outline should stay concise enough for seminar notes."
+            ],
+            "conversation_history": [
+              {
+                "role": "scientist",
+                "message": "Initial proposal using available resources.",
+                "round_number": 1,
+                "action_type": "propose_protocol"
+              },
+              {
+                "role": "lab_manager",
+                "message": "Budget is within range. Equipment is available.",
+                "round_number": 1,
+                "action_type": "report_feasibility"
+              }
+            ],
+            "current_protocol": {
+              "sample_size": 30,
+              "controls": [
+                "positive_control",
+                "negative_control"
+              ],
+              "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+              "duration_days": 5,
+              "required_equipment": [
+                "Structured proof notebook"
+              ],
+              "required_reagents": [
+                "Reference theorem library",
+                "Graduate reviewer"
+              ],
+              "rationale": "Initial proposal using available resources."
+            },
+            "round_number": 1,
+            "max_rounds": 6
+          }
+        },
+        "reward": 0.0,
+        "done": false,
+        "info": {
+          "agreement_reached": false,
+          "error": null,
+          "reward_breakdown": null,
+          "judge_notes": null,
+          "verdict": null
+        }
+      },
+      "response_terminal": {
+        "observation": null,
+        "reward": 5.0,
+        "done": true,
+        "info": {
+          "agreement_reached": true,
+          "error": null,
+          "reward_breakdown": {
+            "rigor": 0.8,
+            "feasibility": 0.8,
+            "fidelity": 0.8,
+            "efficiency_bonus": 0.2,
+            "communication_bonus": 0.1,
+            "penalties": {
+              "timeout": 0.0
+            }
+          },
+          "judge_notes": "Stub audit until judge integration lands.",
+          "verdict": "accept"
+        }
+      }
+    },
+    "GET /scenarios": {
+      "response": {
+        "scenarios": [
+          {
+            "family": "math_reasoning",
+            "difficulties": [
+              "easy",
+              "medium",
+              "hard"
+            ]
+          },
+          {
+            "family": "ml_benchmark",
+            "difficulties": [
+              "easy",
+              "medium",
+              "hard"
+            ]
+          },
+          {
+            "family": "finance_trading",
+            "difficulties": [
+              "easy",
+              "medium",
+              "hard"
+            ]
+          }
+        ]
+      }
+    },
+    "GET /replay/{episode_id}": {
+      "response": {
+        "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
+        "seed": 42,
+        "scenario_template": "math_reasoning",
+        "difficulty": "easy",
+        "final_state": {
+          "seed": 42,
+          "scenario_template": "math_reasoning",
+          "difficulty": "easy",
+          "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
+          "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
+          "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+          "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
+          "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
+          "lab_budget_total": 345.0,
+          "lab_budget_remaining": 345.0,
+          "lab_equipment": [
+            "Structured proof notebook"
+          ],
+          "lab_reagents": [
+            "Reference theorem library",
+            "Graduate reviewer"
+          ],
+          "lab_staff_count": 1,
+          "lab_time_limit_days": 3,
+          "current_protocol": null,
+          "conversation_history": [],
+          "round_number": 3,
+          "max_rounds": 6,
+          "done": true,
+          "agreement_reached": true,
+          "reward": 5.0,
+          "rigor_score": 0.8,
+          "feasibility_score": 0.8,
+          "fidelity_score": 0.8
+        },
+        "transcript": [
+          {
+            "role": "scientist",
+            "message": "Initial proposal using available resources.",
+            "round_number": 1,
+            "action_type": "propose_protocol"
+          },
+          {
+            "role": "lab_manager",
+            "message": "Budget is within range. Equipment is available.",
+            "round_number": 1,
+            "action_type": "report_feasibility"
+          }
+        ],
+        "reward_breakdown": {
+          "rigor": 0.8,
+          "feasibility": 0.8,
+          "fidelity": 0.8,
+          "efficiency_bonus": 0.2,
+          "communication_bonus": 0.1,
+          "penalties": {
+            "timeout": 0.0
+          }
+        },
+        "total_reward": 5.0,
+        "rounds_used": 3,
+        "agreement_reached": true,
+        "judge_notes": "Stub audit until judge integration lands.",
+        "verdict": "accept"
+      }
+    }
+  },
+  "websocket": {
+    "reset": {
+      "client_sends": {
+        "type": "reset",
+        "seed": 42,
+        "scenario": "math_reasoning",
+        "difficulty": "easy"
+      },
+      "server_responds": {
+        "type": "reset_ok",
+        "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
+        "observation": {
+          "scientist": {
+            "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
+            "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
+            "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+            "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
+            "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
+            "conversation_history": [],
+            "current_protocol": null,
+            "round_number": 0,
+            "max_rounds": 6
+          },
+          "lab_manager": {
+            "budget_total": 345.0,
+            "budget_remaining": 345.0,
+            "equipment_available": [
+              "Structured proof notebook"
+            ],
+            "equipment_booked": [],
+            "reagents_in_stock": [
+              "Reference theorem library",
+              "Graduate reviewer"
+            ],
+            "reagents_out_of_stock": [],
+            "staff_count": 1,
+            "time_limit_days": 3,
+            "safety_restrictions": [
+              "The outline should stay concise enough for seminar notes."
+            ],
+            "conversation_history": [],
+            "current_protocol": null,
+            "round_number": 0,
+            "max_rounds": 6
+          }
+        }
+      }
+    },
+    "step": {
+      "client_sends": {
+        "type": "step",
+        "action": {
+          "action_type": "propose_protocol",
+          "sample_size": 30,
+          "controls": [
+            "positive_control",
+            "negative_control"
+          ],
+          "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+          "duration_days": 5,
+          "required_equipment": [
+            "Structured proof notebook"
+          ],
+          "required_reagents": [
+            "Reference theorem library",
+            "Graduate reviewer"
+          ],
+          "questions": [],
+          "rationale": "Initial proposal using available resources."
+        }
+      },
+      "server_responds": {
+        "type": "step_ok",
+        "observation": {
+          "scientist": {
+            "paper_title": "Planning a proof of the Cauchy-Schwarz inequality",
+            "paper_hypothesis": "A square-expansion argument gives the cleanest proof path.",
+            "paper_method": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+            "paper_key_finding": "The proof is accepted only if every inequality step and equality case is justified.",
+            "experiment_goal": "Produce a proof-planning workflow for the Cauchy-Schwarz inequality for an undergraduate seminar handout.",
+            "conversation_history": [
+              {
+                "role": "scientist",
+                "message": "Initial proposal using available resources.",
+                "round_number": 1,
+                "action_type": "propose_protocol"
+              },
+              {
+                "role": "lab_manager",
+                "message": "Budget is within range. Equipment is available.",
+                "round_number": 1,
+                "action_type": "report_feasibility"
+              }
+            ],
+            "current_protocol": {
+              "sample_size": 30,
+              "controls": [
+                "positive_control",
+                "negative_control"
+              ],
+              "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+              "duration_days": 5,
+              "required_equipment": [
+                "Structured proof notebook"
+              ],
+              "required_reagents": [
+                "Reference theorem library",
+                "Graduate reviewer"
+              ],
+              "rationale": "Initial proposal using available resources."
+            },
+            "round_number": 1,
+            "max_rounds": 6
+          },
+          "lab_manager": {
+            "budget_total": 345.0,
+            "budget_remaining": 345.0,
+            "equipment_available": [
+              "Structured proof notebook"
+            ],
+            "equipment_booked": [],
+            "reagents_in_stock": [
+              "Reference theorem library",
+              "Graduate reviewer"
+            ],
+            "reagents_out_of_stock": [],
+            "staff_count": 1,
+            "time_limit_days": 3,
+            "safety_restrictions": [
+              "The outline should stay concise enough for seminar notes."
+            ],
+            "conversation_history": [
+              {
+                "role": "scientist",
+                "message": "Initial proposal using available resources.",
+                "round_number": 1,
+                "action_type": "propose_protocol"
+              },
+              {
+                "role": "lab_manager",
+                "message": "Budget is within range. Equipment is available.",
+                "round_number": 1,
+                "action_type": "report_feasibility"
+              }
+            ],
+            "current_protocol": {
+              "sample_size": 30,
+              "controls": [
+                "positive_control",
+                "negative_control"
+              ],
+              "technique": "Outline the proof using one algebraic identity, one equality-case check, and reviewer notes.",
+              "duration_days": 5,
+              "required_equipment": [
+                "Structured proof notebook"
+              ],
+              "required_reagents": [
+                "Reference theorem library",
+                "Graduate reviewer"
+              ],
+              "rationale": "Initial proposal using available resources."
+            },
+            "round_number": 1,
+            "max_rounds": 6
+          }
+        },
+        "reward": 0.0,
+        "done": false,
+        "info": {
+          "agreement_reached": false,
+          "error": null,
+          "reward_breakdown": null,
+          "judge_notes": null,
+          "verdict": null
+        }
+      }
+    },
+    "ping": {
+      "client_sends": {
+        "type": "ping"
+      },
+      "server_responds": {
+        "type": "pong"
+      }
+    }
+  }
+}

tests/fixtures/generate_api_examples.py ADDED Viewed

	@@ -0,0 +1,330 @@

+#!/usr/bin/env python3
+"""Generate api_schema_examples.json from real Pydantic models.
+MOD 10 — run this script to regenerate the fixture whenever the
+contracts change.  The output is deterministic.
+Usage:
+    python tests/fixtures/generate_api_examples.py
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from replicalab.config import DEFAULT_DIFFICULTY, DEFAULT_SCENARIO_TEMPLATE
+from replicalab.models import (
+    ConversationEntry,
+    EpisodeLog,
+    EpisodeState,
+    LabManagerObservation,
+    Observation,
+    Protocol,
+    RewardBreakdown,
+    ScientistAction,
+    ScientistObservation,
+    StepInfo,
+    StepResult,
+)
+from replicalab.scenarios import available_scenario_families, generate_scenario
+OUTPUT_PATH = Path(__file__).parent / "api_schema_examples.json"
+# ---------------------------------------------------------------------------
+# Build realistic payloads from real models
+# ---------------------------------------------------------------------------
+_SEED = 42
+_TEMPLATE = DEFAULT_SCENARIO_TEMPLATE
+_DIFFICULTY = DEFAULT_DIFFICULTY
+# Generate a real scenario to extract observation data
+_pack = generate_scenario(seed=_SEED, template=_TEMPLATE, difficulty=_DIFFICULTY)
+_sci_obs = _pack.scientist_observation
+_lm_obs = _pack.lab_manager_observation
+def _reset_request():
+    return {
+        "seed": _SEED,
+        "scenario": _TEMPLATE,
+        "difficulty": _DIFFICULTY,
+        "session_id": None,
+    }
+def _reset_response():
+    obs = Observation(scientist=_sci_obs, lab_manager=_lm_obs)
+    return {
+        "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
+        "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
+        "observation": obs.model_dump(),
+    }
+def _propose_action():
+    return ScientistAction(
+        action_type="propose_protocol",
+        sample_size=30,
+        controls=["positive_control", "negative_control"],
+        technique=_sci_obs.paper_method,
+        duration_days=5,
+        required_equipment=list(_lm_obs.equipment_available[:2]) if _lm_obs.equipment_available else ["tool_a"],
+        required_reagents=list(_lm_obs.reagents_in_stock[:2]) if _lm_obs.reagents_in_stock else ["ref_a"],
+        questions=[],
+        rationale="Initial proposal using available resources.",
+    ).model_dump()
+def _step_request():
+    return {
+        "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
+        "action": _propose_action(),
+    }
+def _mid_episode_step_result():
+    protocol = Protocol(
+        sample_size=30,
+        controls=["positive_control", "negative_control"],
+        technique=_sci_obs.paper_method,
+        duration_days=5,
+        required_equipment=list(_lm_obs.equipment_available[:2]) if _lm_obs.equipment_available else ["tool_a"],
+        required_reagents=list(_lm_obs.reagents_in_stock[:2]) if _lm_obs.reagents_in_stock else ["ref_a"],
+        rationale="Initial proposal using available resources.",
+    )
+    history = [
+        ConversationEntry(
+            role="scientist",
+            message="Initial proposal using available resources.",
+            round_number=1,
+            action_type="propose_protocol",
+        ),
+        ConversationEntry(
+            role="lab_manager",
+            message="Budget is within range. Equipment is available.",
+            round_number=1,
+            action_type="report_feasibility",
+        ),
+    ]
+    obs = Observation(
+        scientist=ScientistObservation(
+            paper_title=_sci_obs.paper_title,
+            paper_hypothesis=_sci_obs.paper_hypothesis,
+            paper_method=_sci_obs.paper_method,
+            paper_key_finding=_sci_obs.paper_key_finding,
+            experiment_goal=_sci_obs.experiment_goal,
+            conversation_history=history,
+            current_protocol=protocol,
+            round_number=1,
+            max_rounds=_sci_obs.max_rounds,
+        ),
+        lab_manager=LabManagerObservation(
+            budget_total=_lm_obs.budget_total,
+            budget_remaining=_lm_obs.budget_remaining,
+            equipment_available=list(_lm_obs.equipment_available),
+            equipment_booked=list(_lm_obs.equipment_booked),
+            reagents_in_stock=list(_lm_obs.reagents_in_stock),
+            reagents_out_of_stock=list(_lm_obs.reagents_out_of_stock),
+            staff_count=_lm_obs.staff_count,
+            time_limit_days=_lm_obs.time_limit_days,
+            safety_restrictions=list(_lm_obs.safety_restrictions),
+            conversation_history=history,
+            current_protocol=protocol,
+            round_number=1,
+            max_rounds=_lm_obs.max_rounds,
+        ),
+    )
+    return StepResult(
+        observation=obs,
+        reward=0.0,
+        done=False,
+        info=StepInfo(
+            agreement_reached=False,
+            error=None,
+            reward_breakdown=None,
+            judge_notes=None,
+            verdict=None,
+        ),
+    ).model_dump()
+def _terminal_step_result():
+    return StepResult(
+        observation=None,
+        reward=5.0,
+        done=True,
+        info=StepInfo(
+            agreement_reached=True,
+            error=None,
+            reward_breakdown=RewardBreakdown(
+                rigor=0.8,
+                feasibility=0.8,
+                fidelity=0.8,
+                efficiency_bonus=0.2,
+                communication_bonus=0.1,
+                penalties={"timeout": 0.0},
+            ),
+            judge_notes="Stub audit until judge integration lands.",
+            verdict="accept",
+        ),
+    ).model_dump()
+def _scenarios_response():
+    return {"scenarios": available_scenario_families()}
+def _replay_response():
+    return EpisodeLog(
+        episode_id="ep-deadbeef-1234-5678-9abc-def012345678",
+        seed=_SEED,
+        scenario_template=_TEMPLATE,
+        difficulty=_DIFFICULTY,
+        final_state=EpisodeState(
+            seed=_SEED,
+            scenario_template=_TEMPLATE,
+            difficulty=_DIFFICULTY,
+            paper_title=_sci_obs.paper_title,
+            paper_hypothesis=_sci_obs.paper_hypothesis,
+            paper_method=_sci_obs.paper_method,
+            paper_key_finding=_sci_obs.paper_key_finding,
+            experiment_goal=_sci_obs.experiment_goal,
+            lab_budget_total=_lm_obs.budget_total,
+            lab_budget_remaining=_lm_obs.budget_remaining,
+            lab_equipment=list(_lm_obs.equipment_available),
+            lab_reagents=list(_lm_obs.reagents_in_stock),
+            lab_staff_count=_lm_obs.staff_count,
+            lab_time_limit_days=_lm_obs.time_limit_days,
+            round_number=3,
+            max_rounds=_sci_obs.max_rounds,
+            done=True,
+            agreement_reached=True,
+            reward=5.0,
+            rigor_score=0.8,
+            feasibility_score=0.8,
+            fidelity_score=0.8,
+        ),
+        transcript=[
+            ConversationEntry(
+                role="scientist",
+                message="Initial proposal using available resources.",
+                round_number=1,
+                action_type="propose_protocol",
+            ),
+            ConversationEntry(
+                role="lab_manager",
+                message="Budget is within range. Equipment is available.",
+                round_number=1,
+                action_type="report_feasibility",
+            ),
+        ],
+        reward_breakdown=RewardBreakdown(
+            rigor=0.8,
+            feasibility=0.8,
+            fidelity=0.8,
+            efficiency_bonus=0.2,
+            communication_bonus=0.1,
+            penalties={"timeout": 0.0},
+        ),
+        total_reward=5.0,
+        rounds_used=3,
+        agreement_reached=True,
+        judge_notes="Stub audit until judge integration lands.",
+        verdict="accept",
+    ).model_dump()
+def _ws_reset_message():
+    return {
+        "type": "reset",
+        "seed": _SEED,
+        "scenario": _TEMPLATE,
+        "difficulty": _DIFFICULTY,
+    }
+def _ws_reset_ok_message():
+    obs = Observation(scientist=_sci_obs, lab_manager=_lm_obs)
+    return {
+        "type": "reset_ok",
+        "episode_id": "ep-deadbeef-1234-5678-9abc-def012345678",
+        "observation": obs.model_dump(),
+    }
+def _ws_step_message():
+    return {
+        "type": "step",
+        "action": _propose_action(),
+    }
+def _ws_step_ok_message():
+    return {
+        "type": "step_ok",
+        **_mid_episode_step_result(),
+    }
+# ---------------------------------------------------------------------------
+# Assemble and write
+# ---------------------------------------------------------------------------
+def main():
+    examples = {
+        "_meta": {
+            "generated_by": "tests/fixtures/generate_api_examples.py",
+            "description": "API schema examples generated from real Pydantic models. Re-run the script to regenerate after contract changes.",
+            "seed": _SEED,
+            "scenario_template": _TEMPLATE,
+            "difficulty": _DIFFICULTY,
+        },
+        "rest": {
+            "POST /reset": {
+                "request": _reset_request(),
+                "response": _reset_response(),
+            },
+            "POST /step": {
+                "request": _step_request(),
+                "response_mid_episode": _mid_episode_step_result(),
+                "response_terminal": _terminal_step_result(),
+            },
+            "GET /scenarios": {
+                "response": _scenarios_response(),
+            },
+            "GET /replay/{episode_id}": {
+                "response": _replay_response(),
+            },
+        },
+        "websocket": {
+            "reset": {
+                "client_sends": _ws_reset_message(),
+                "server_responds": _ws_reset_ok_message(),
+            },
+            "step": {
+                "client_sends": _ws_step_message(),
+                "server_responds": _ws_step_ok_message(),
+            },
+            "ping": {
+                "client_sends": {"type": "ping"},
+                "server_responds": {"type": "pong"},
+            },
+        },
+    }
+    OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    OUTPUT_PATH.write_text(
+        json.dumps(examples, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8",
+    )
+    print(f"Wrote {OUTPUT_PATH}")
+if __name__ == "__main__":
+    main()

tests/test_client.py ADDED Viewed

	@@ -0,0 +1,355 @@

+"""Client module tests — TRN 13.
+Tests cover ReplicaLabClient with both REST and WebSocket transports
+against the real FastAPI test server.
+"""
+from __future__ import annotations
+import contextlib
+import json
+import threading
+import time
+import pytest
+import uvicorn
+from replicalab.client import ReplicaLabClient
+from replicalab.models import (
+    Observation,
+    ScientistAction,
+    StepResult,
+)
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _propose_action(obs: Observation) -> ScientistAction:
+    """Build a valid propose_protocol action from the observation."""
+    from replicalab.scenarios import generate_scenario
+    pack = generate_scenario(seed=42, template="math_reasoning", difficulty="easy")
+    lab = pack.lab_manager_observation
+    spec = pack.hidden_reference_spec
+    return ScientistAction(
+        action_type="propose_protocol",
+        sample_size=10,
+        controls=["baseline", "ablation"],
+        technique=spec.summary[:60] if spec.summary else "replication_plan",
+        duration_days=max(1, min(2, lab.time_limit_days)),
+        required_equipment=list(lab.equipment_available[:1]) if lab.equipment_available else [],
+        required_reagents=list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else [],
+        questions=[],
+        rationale=(
+            f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
+            f"Target metric: {spec.target_metric}. "
+            f"Target value: {spec.target_value}. "
+            "Stay within budget and schedule."
+        ),
+    )
+def _accept_action() -> ScientistAction:
+    return ScientistAction(
+        action_type="accept",
+        sample_size=0,
+        controls=[],
+        technique="",
+        duration_days=0,
+        required_equipment=[],
+        required_reagents=[],
+        questions=[],
+        rationale="",
+    )
+# ---------------------------------------------------------------------------
+# REST transport tests (uses httpx directly against TestClient-proxied app)
+# ---------------------------------------------------------------------------
+# We spin up a real uvicorn server on a random port for both transports
+# to keep things realistic and test the actual HTTP/WS paths.
+_TEST_PORT = 18765
+@pytest.fixture(scope="module")
+def live_server():
+    """Start a live uvicorn server for the test module."""
+    from server.app import app
+    config = uvicorn.Config(app, host="127.0.0.1", port=_TEST_PORT, log_level="error")
+    server = uvicorn.Server(config)
+    thread = threading.Thread(target=server.run, daemon=True)
+    thread.start()
+    # Wait until server is ready
+    import httpx
+    for _ in range(50):
+        try:
+            resp = httpx.get(f"http://127.0.0.1:{_TEST_PORT}/health", timeout=1.0)
+            if resp.status_code == 200:
+                break
+        except Exception:
+            pass
+        time.sleep(0.1)
+    else:
+        pytest.fail("Live server did not start in time")
+    yield f"http://127.0.0.1:{_TEST_PORT}"
+    server.should_exit = True
+    thread.join(timeout=5)
+# ---------------------------------------------------------------------------
+# REST transport
+# ---------------------------------------------------------------------------
+class TestRestConnect:
+    """connect() over REST verifies server health."""
+    def test_connect_succeeds(self, live_server: str) -> None:
+        client = ReplicaLabClient(live_server, transport="rest")
+        client.connect()
+        assert client.connected
+        client.close()
+    def test_connect_bad_url_raises(self) -> None:
+        client = ReplicaLabClient("http://127.0.0.1:19999", transport="rest", timeout=1.0)
+        with pytest.raises(Exception):
+            client.connect()
+class TestRestReset:
+    """reset() over REST."""
+    def test_reset_returns_observation(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            obs = client.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+            assert isinstance(obs, Observation)
+            assert obs.scientist is not None
+            assert obs.scientist.paper_title
+            assert obs.lab_manager is not None
+            assert obs.lab_manager.budget_total > 0
+    def test_reset_sets_session_and_episode_id(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            client.reset(seed=1)
+            assert client.session_id is not None
+            assert client.episode_id is not None
+    def test_reset_reuses_session(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            client.reset(seed=1)
+            sid1 = client.session_id
+            ep1 = client.episode_id
+            client.reset(seed=2)
+            assert client.session_id == sid1
+            assert client.episode_id != ep1
+class TestRestStep:
+    """step() over REST."""
+    def test_step_returns_step_result(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            obs = client.reset(seed=42)
+            action = _propose_action(obs)
+            result = client.step(action)
+            assert isinstance(result, StepResult)
+            assert result.done is False
+            assert result.observation is not None
+    def test_step_before_reset_raises(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            with pytest.raises(RuntimeError, match="reset"):
+                client.step(_accept_action())
+    def test_full_episode_propose_accept(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            obs = client.reset(seed=42)
+            action = _propose_action(obs)
+            result1 = client.step(action)
+            assert result1.done is False
+            result2 = client.step(_accept_action())
+            assert result2.done is True
+            assert result2.reward > 0.0
+            assert result2.info.agreement_reached is True
+            assert result2.info.verdict == "accept"
+            assert result2.info.reward_breakdown is not None
+            assert 0.0 <= result2.info.reward_breakdown.rigor <= 1.0
+class TestRestReplay:
+    """replay() over REST."""
+    def test_replay_after_episode(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="rest") as client:
+            obs = client.reset(seed=42)
+            action = _propose_action(obs)
+            client.step(action)
+            client.step(_accept_action())
+            episode_id = client.episode_id
+            assert episode_id is not None
+            replay = client.replay(episode_id)
+            assert replay.agreement_reached is True
+            assert replay.total_reward > 0.0
+            assert replay.verdict == "accept"
+class TestRestContextManager:
+    """Context manager cleans up on exit."""
+    def test_context_manager_closes(self, live_server: str) -> None:
+        client = ReplicaLabClient(live_server, transport="rest")
+        with client:
+            assert client.connected
+            client.reset(seed=1)
+        assert not client.connected
+# ---------------------------------------------------------------------------
+# WebSocket transport
+# ---------------------------------------------------------------------------
+class TestWsConnect:
+    """connect() over WebSocket."""
+    def test_connect_succeeds(self, live_server: str) -> None:
+        client = ReplicaLabClient(live_server, transport="websocket")
+        client.connect()
+        assert client.connected
+        client.close()
+    def test_connect_bad_url_raises(self) -> None:
+        client = ReplicaLabClient("http://127.0.0.1:19999", transport="websocket", timeout=1.0)
+        with pytest.raises(Exception):
+            client.connect()
+class TestWsReset:
+    """reset() over WebSocket."""
+    def test_reset_returns_observation(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            obs = client.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+            assert isinstance(obs, Observation)
+            assert obs.scientist is not None
+            assert obs.scientist.paper_title
+            assert obs.lab_manager is not None
+            assert obs.lab_manager.budget_total > 0
+    def test_reset_sets_episode_id(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            client.reset(seed=42)
+            assert client.episode_id is not None
+    def test_ws_session_id_is_none(self, live_server: str) -> None:
+        """WebSocket transport has no explicit session_id."""
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            client.reset(seed=42)
+            assert client.session_id is None
+class TestWsStep:
+    """step() over WebSocket."""
+    def test_step_returns_step_result(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            obs = client.reset(seed=42)
+            action = _propose_action(obs)
+            result = client.step(action)
+            assert isinstance(result, StepResult)
+            assert result.done is False
+            assert result.observation is not None
+    def test_full_episode_propose_accept(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            obs = client.reset(seed=42)
+            action = _propose_action(obs)
+            result1 = client.step(action)
+            assert result1.done is False
+            result2 = client.step(_accept_action())
+            assert result2.done is True
+            assert result2.reward > 0.0
+            assert result2.info.agreement_reached is True
+            assert result2.info.verdict == "accept"
+            assert result2.info.reward_breakdown is not None
+            assert 0.0 <= result2.info.reward_breakdown.rigor <= 1.0
+    def test_semantic_invalid_action_step_ok_with_error(self, live_server: str) -> None:
+        """Semantically invalid action → step result with info.error, not crash."""
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            client.reset(seed=42)
+            bad_action = ScientistAction(
+                action_type="propose_protocol",
+                sample_size=5,
+                controls=["baseline"],
+                technique="some technique",
+                duration_days=999,
+                required_equipment=[],
+                required_reagents=[],
+                questions=[],
+                rationale="Duration is impossibly long.",
+            )
+            result = client.step(bad_action)
+            assert result.done is False
+            assert result.info.error is not None
+            assert "Validation errors" in result.info.error
+class TestWsContextManager:
+    """Context manager cleans up on exit."""
+    def test_context_manager_closes(self, live_server: str) -> None:
+        client = ReplicaLabClient(live_server, transport="websocket")
+        with client:
+            assert client.connected
+            client.reset(seed=1)
+        assert not client.connected
+class TestWsUnsupported:
+    """state() and replay() raise NotImplementedError on WS transport."""
+    def test_state_not_supported(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            client.reset(seed=42)
+            with pytest.raises(NotImplementedError):
+                client.state()
+    def test_replay_not_supported(self, live_server: str) -> None:
+        with ReplicaLabClient(live_server, transport="websocket") as client:
+            with pytest.raises(NotImplementedError):
+                client.replay("some-id")
+# ---------------------------------------------------------------------------
+# Constructor validation
+# ---------------------------------------------------------------------------
+class TestConstructor:
+    """Transport selection and validation."""
+    def test_unknown_transport_raises(self) -> None:
+        with pytest.raises(ValueError, match="Unknown transport"):
+            ReplicaLabClient(transport="grpc")
+    def test_not_connected_raises_on_reset(self) -> None:
+        client = ReplicaLabClient(transport="rest")
+        with pytest.raises(RuntimeError, match="not connected"):
+            client.reset(seed=1)
+    def test_default_transport_is_websocket(self) -> None:
+        client = ReplicaLabClient()
+        # Check internal transport type
+        assert type(client._transport).__name__ == "_WsTransport"

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,635 @@

+"""Tests for ENV 01–08 and JDG 04–05.
+TST 01: reset returns valid observations
+TST 02: valid step advances round, terminal path returns correct shape
+TST 03: invalid action returns structured error, env survives
+"""
+from __future__ import annotations
+import pytest
+from replicalab.env import ReplicaLabEnv
+from replicalab.models import (
+    Protocol,
+    RewardBreakdown,
+    ScientistAction,
+)
+from replicalab.scenarios import generate_scenario
+from replicalab.scoring.rubric import build_reward_breakdown, compute_total_reward
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _scenario(template: str = "math_reasoning", difficulty: str = "easy"):
+    return generate_scenario(seed=42, template=template, difficulty=difficulty)
+def _good_action(scenario) -> ScientistAction:
+    """Build a valid propose_protocol action that fits the scenario."""
+    lab = scenario.lab_manager_observation
+    spec = scenario.hidden_reference_spec
+    return ScientistAction(
+        action_type="propose_protocol",
+        sample_size=10,
+        controls=["baseline", "ablation"],
+        technique=spec.summary[:60] if spec.summary else "replication_plan",
+        duration_days=max(1, min(2, lab.time_limit_days)),
+        required_equipment=(
+            list(lab.equipment_available[:1]) if lab.equipment_available else []
+        ),
+        required_reagents=(
+            list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else []
+        ),
+        questions=[],
+        rationale=(
+            f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
+            f"Target metric: {spec.target_metric}. "
+            f"Target value: {spec.target_value}. "
+            "Stay within budget and schedule."
+        ),
+    )
+def _accept_action() -> ScientistAction:
+    """Build a valid accept action."""
+    return ScientistAction(
+        action_type="accept",
+        sample_size=0,
+        controls=[],
+        technique="",
+        duration_days=0,
+        required_equipment=[],
+        required_reagents=[],
+        questions=[],
+        rationale="",
+    )
+def _request_info_action() -> ScientistAction:
+    return ScientistAction(
+        action_type="request_info",
+        sample_size=0,
+        controls=[],
+        technique="",
+        duration_days=0,
+        required_equipment=[],
+        required_reagents=[],
+        questions=["What equipment is available?"],
+        rationale="",
+    )
+def _good_protocol(scenario) -> Protocol:
+    """Build a well-formed protocol aligned to the scenario."""
+    lab = scenario.lab_manager_observation
+    spec = scenario.hidden_reference_spec
+    return Protocol(
+        sample_size=10,
+        controls=["baseline", "ablation"],
+        technique=spec.summary[:60] if spec.summary else "replication_plan",
+        duration_days=max(1, min(2, lab.time_limit_days)),
+        required_equipment=(
+            list(lab.equipment_available[:1]) if lab.equipment_available else []
+        ),
+        required_reagents=(
+            list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else []
+        ),
+        rationale=(
+            f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
+            f"Target metric: {spec.target_metric}. "
+            f"Target value: {spec.target_value}. "
+            "Stay within budget and schedule."
+        ),
+    )
+# ---------------------------------------------------------------------------
+# TST 01 — reset returns valid observations
+# ---------------------------------------------------------------------------
+class TestReset:
+    """TST 01: reset() returns a well-formed Observation."""
+    def test_reset_returns_observation_with_both_roles(self) -> None:
+        env = ReplicaLabEnv()
+        obs = env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+        assert obs.scientist is not None
+        assert obs.lab_manager is not None
+    def test_reset_scientist_fields_populated(self) -> None:
+        env = ReplicaLabEnv()
+        obs = env.reset(seed=42, scenario="ml_benchmark", difficulty="easy")
+        s = obs.scientist
+        assert s.paper_title
+        assert s.paper_hypothesis
+        assert s.experiment_goal
+        assert s.round_number == 0
+        assert s.max_rounds > 0
+        assert s.current_protocol is None
+        assert s.conversation_history == []
+    def test_reset_lab_manager_fields_populated(self) -> None:
+        env = ReplicaLabEnv()
+        obs = env.reset(seed=42, scenario="finance_trading", difficulty="easy")
+        lm = obs.lab_manager
+        assert lm.budget_total > 0
+        assert lm.budget_remaining > 0
+        assert lm.staff_count > 0
+        assert lm.time_limit_days > 0
+        assert lm.round_number == 0
+    def test_reset_preserves_booked_and_out_of_stock(self) -> None:
+        """ENV 02: booked/out-of-stock data comes from the scenario pack,
+        not hardcoded empty lists."""
+        env = ReplicaLabEnv()
+        # hard difficulty is more likely to have unavailable resources
+        obs = env.reset(seed=42, scenario="ml_benchmark", difficulty="hard")
+        lm = obs.lab_manager
+        # The observation should carry scenario data (may or may not have
+        # booked items depending on scenario, but the lists should exist)
+        assert isinstance(lm.equipment_booked, list)
+        assert isinstance(lm.reagents_out_of_stock, list)
+        assert isinstance(lm.safety_restrictions, list)
+        assert len(lm.safety_restrictions) > 0  # always has at least one
+    def test_reset_state_round_zero(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=1)
+        s = env.state()
+        assert s.round_number == 0
+        assert s.done is False
+        assert s.agreement_reached is False
+    def test_reset_generates_episode_id(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=1)
+        eid = env.episode_id()
+        assert eid
+        assert len(eid) > 10  # UUID
+    def test_reset_clears_previous_episode(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=1, scenario="math_reasoning")
+        first_id = env.episode_id()
+        env.reset(seed=2, scenario="ml_benchmark")
+        second_id = env.episode_id()
+        assert first_id != second_id
+        assert env.state().round_number == 0
+    def test_reset_all_templates_and_difficulties(self) -> None:
+        env = ReplicaLabEnv()
+        for template in ("math_reasoning", "ml_benchmark", "finance_trading"):
+            for difficulty in ("easy", "medium", "hard"):
+                obs = env.reset(seed=7, scenario=template, difficulty=difficulty)
+                assert obs.scientist is not None
+                assert obs.lab_manager is not None
+# ---------------------------------------------------------------------------
+# TST 03 — invalid action returns structured error, env survives
+# ---------------------------------------------------------------------------
+class TestInvalidAction:
+    """TST 03: env returns structured error for invalid proposals."""
+    def test_invalid_duration_returns_error_string(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario("math_reasoning", "easy")
+        env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+        # duration exceeds time limit
+        bad_action = ScientistAction(
+            action_type="propose_protocol",
+            sample_size=5,
+            controls=["baseline"],
+            technique="some technique",
+            duration_days=999,
+            required_equipment=[],
+            required_reagents=[],
+            questions=[],
+            rationale="This has way too long a duration for the lab.",
+        )
+        result = env.step(bad_action)
+        assert result.done is False
+        assert result.info.error is not None
+        assert "Validation errors" in result.info.error
+    def test_env_survives_after_invalid_action(self) -> None:
+        """After returning an error, the env still accepts valid actions."""
+        env = ReplicaLabEnv()
+        scenario = _scenario("math_reasoning", "easy")
+        env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+        # Send invalid action
+        bad_action = ScientistAction(
+            action_type="propose_protocol",
+            sample_size=5,
+            controls=["baseline"],
+            technique="some technique",
+            duration_days=999,
+            required_equipment=[],
+            required_reagents=[],
+            questions=[],
+            rationale="Way too long duration for the lab to handle.",
+        )
+        error_result = env.step(bad_action)
+        assert error_result.info.error is not None
+        # Now send a valid action — env should still work
+        good = _good_action(scenario)
+        result = env.step(good)
+        assert result.info.error is None
+        assert result.done is False
+    def test_invalid_action_does_not_advance_round(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=42, scenario="math_reasoning", difficulty="easy")
+        bad_action = ScientistAction(
+            action_type="propose_protocol",
+            sample_size=5,
+            controls=["baseline"],
+            technique="some technique",
+            duration_days=999,
+            required_equipment=[],
+            required_reagents=[],
+            questions=[],
+            rationale="Duration is impossibly long for this scenario.",
+        )
+        result = env.step(bad_action)
+        assert result.info.error is not None
+        assert env.state().round_number == 0
+    def test_request_info_always_passes_validation(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=42)
+        result = env.step(_request_info_action())
+        assert result.info.error is None
+        assert result.done is False
+# ---------------------------------------------------------------------------
+# TST 02 — valid step advances round, terminal path
+# ---------------------------------------------------------------------------
+class TestStep:
+    """TST 02: step() advances rounds and terminal path returns correct shape."""
+    def test_step_advances_round_number(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        action = _good_action(scenario)
+        result = env.step(action)
+        assert env.state().round_number == 1
+        assert result.done is False
+        assert result.reward == 0.0
+    def test_step_returns_observations(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        result = env.step(_good_action(scenario))
+        assert result.observation is not None
+        assert result.observation.scientist is not None
+        assert result.observation.lab_manager is not None
+        assert result.observation.scientist.round_number == 1
+    def test_step_records_conversation_history(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        env.step(_good_action(scenario))
+        s = env.state()
+        # Should have 2 entries: scientist + lab manager
+        assert len(s.conversation_history) == 2
+        assert s.conversation_history[0].role == "scientist"
+        assert s.conversation_history[1].role == "lab_manager"
+    def test_accept_with_protocol_terminates(self) -> None:
+        """Scientist accept with an existing protocol → done."""
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        # First propose a protocol
+        env.step(_good_action(scenario))
+        # Then accept
+        result = env.step(_accept_action())
+        assert result.done is True
+        assert result.info.agreement_reached is True
+    def test_accept_terminal_step_has_real_reward(self) -> None:
+        """ENV 06: terminal accept computes real judge scores, not stub 0.8."""
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        env.step(_good_action(scenario))
+        result = env.step(_accept_action())
+        assert result.done is True
+        assert result.reward > 0.0
+        assert result.info.reward_breakdown is not None
+        rb = result.info.reward_breakdown
+        assert 0.0 <= rb.rigor <= 1.0
+        assert 0.0 <= rb.feasibility <= 1.0
+        assert 0.0 <= rb.fidelity <= 1.0
+        # Verify it's not the old stub 0.8
+        assert not (rb.rigor == 0.8 and rb.feasibility == 0.8 and rb.fidelity == 0.8)
+    def test_max_rounds_terminates(self) -> None:
+        """Reaching max_rounds terminates without agreement."""
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        max_r = env.state().max_rounds
+        for i in range(max_r):
+            result = env.step(_good_action(scenario))
+        assert result.done is True
+        assert result.info.agreement_reached is False
+        assert result.reward == 0.0
+    def test_step_info_has_round_and_episode_id(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        result = env.step(_good_action(scenario))
+        assert result.info.round == 1
+        assert result.info.episode_id == env.episode_id()
+    def test_full_episode_propose_then_accept(self) -> None:
+        """Full 2-step episode: propose → accept."""
+        env = ReplicaLabEnv()
+        scenario = _scenario("ml_benchmark", "easy")
+        env.reset(seed=42, scenario="ml_benchmark", difficulty="easy")
+        r1 = env.step(_good_action(scenario))
+        assert not r1.done
+        r2 = env.step(_accept_action())
+        assert r2.done
+        assert r2.info.agreement_reached
+        assert r2.reward > 0
+# ---------------------------------------------------------------------------
+# ENV 07 — state() returns deep snapshot
+# ---------------------------------------------------------------------------
+class TestStateSnapshot:
+    """ENV 07: state() returns a deep copy, not a reference."""
+    def test_state_is_deep_copy(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=42)
+        s1 = env.state()
+        s1.round_number = 999  # mutate the snapshot
+        s2 = env.state()
+        assert s2.round_number == 0  # env state unaffected
+    def test_state_history_is_independent(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        env.step(_good_action(scenario))
+        s1 = env.state()
+        original_len = len(s1.conversation_history)
+        s1.conversation_history.clear()
+        s2 = env.state()
+        assert len(s2.conversation_history) == original_len
+# ---------------------------------------------------------------------------
+# ENV 08 — close() and _ensure_open()
+# ---------------------------------------------------------------------------
+class TestCloseReopen:
+    """ENV 08: close/reopen lifecycle."""
+    def test_close_is_idempotent(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=1)
+        env.close()
+        env.close()  # should not raise
+    def test_step_after_close_raises(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=1)
+        env.close()
+        with pytest.raises(RuntimeError, match="closed"):
+            env.step(_good_action(scenario))
+    def test_reset_reopens_closed_env(self) -> None:
+        env = ReplicaLabEnv()
+        env.reset(seed=1)
+        env.close()
+        # reset should reopen
+        obs = env.reset(seed=2)
+        assert obs.scientist is not None
+        # step should work again
+        scenario = _scenario()
+        result = env.step(_good_action(scenario))
+        assert result.info.error is None
+# ---------------------------------------------------------------------------
+# JDG 04-05 — rubric unit tests
+# ---------------------------------------------------------------------------
+class TestRubric:
+    """JDG 04-05: compute_total_reward and build_reward_breakdown."""
+    def test_compute_total_reward_formula(self) -> None:
+        """10 × rigor × feasibility × fidelity + bonuses − penalties."""
+        rb = RewardBreakdown(
+            rigor=1.0,
+            feasibility=1.0,
+            fidelity=1.0,
+            efficiency_bonus=0.5,
+            communication_bonus=0.0,
+            penalties={},
+        )
+        total = compute_total_reward(rb)
+        assert total == 10.5  # 10*1*1*1 + 0.5
+    def test_compute_total_reward_with_penalties(self) -> None:
+        rb = RewardBreakdown(
+            rigor=0.8,
+            feasibility=0.9,
+            fidelity=0.7,
+            efficiency_bonus=0.0,
+            communication_bonus=0.0,
+            penalties={"timeout": 1.0, "invalid": 0.5},
+        )
+        expected = 10 * 0.8 * 0.9 * 0.7 - 1.5  # 5.04 - 1.5 = 3.54
+        assert abs(compute_total_reward(rb) - expected) < 0.001
+    def test_compute_total_reward_zero_scores(self) -> None:
+        rb = RewardBreakdown(rigor=0.0, feasibility=0.5, fidelity=0.5)
+        assert compute_total_reward(rb) == 0.0
+    def test_build_reward_breakdown_returns_valid_scores(self) -> None:
+        scenario = _scenario("ml_benchmark", "easy")
+        protocol = _good_protocol(scenario)
+        breakdown = build_reward_breakdown(
+            protocol=protocol,
+            scenario=scenario,
+            rounds_used=1,
+            max_rounds=6,
+        )
+        assert 0.0 <= breakdown.rigor <= 1.0
+        assert 0.0 <= breakdown.feasibility <= 1.0
+        assert 0.0 <= breakdown.fidelity <= 1.0
+        assert breakdown.efficiency_bonus >= 0.0
+    def test_build_reward_breakdown_efficiency_bonus(self) -> None:
+        """Finishing in fewer rounds gives a higher bonus."""
+        scenario = _scenario()
+        protocol = _good_protocol(scenario)
+        fast = build_reward_breakdown(protocol, scenario, rounds_used=1, max_rounds=6)
+        slow = build_reward_breakdown(protocol, scenario, rounds_used=5, max_rounds=6)
+        assert fast.efficiency_bonus > slow.efficiency_bonus
+    def test_build_reward_breakdown_is_deterministic(self) -> None:
+        scenario = _scenario("finance_trading", "medium")
+        protocol = _good_protocol(scenario)
+        b1 = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
+        b2 = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
+        assert b1.rigor == b2.rigor
+        assert b1.feasibility == b2.feasibility
+        assert b1.fidelity == b2.fidelity
+        assert b1.efficiency_bonus == b2.efficiency_bonus
+    def test_total_reward_matches_manual_calculation(self) -> None:
+        scenario = _scenario("math_reasoning", "easy")
+        protocol = _good_protocol(scenario)
+        breakdown = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
+        total = compute_total_reward(breakdown)
+        expected = (
+            10.0 * breakdown.rigor * breakdown.feasibility * breakdown.fidelity
+            + breakdown.efficiency_bonus
+            + breakdown.communication_bonus
+            - sum(breakdown.penalties.values())
+        )
+        assert abs(total - expected) < 0.0001
+# ---------------------------------------------------------------------------
+# ENV 06 — terminal reward wiring
+# ---------------------------------------------------------------------------
+class TestEnvReward:
+    """ENV 06: real judge scoring at terminal steps."""
+    def test_agreement_terminal_has_breakdown_notes_verdict(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        env.step(_good_action(scenario))
+        result = env.step(_accept_action())
+        assert result.done
+        assert result.info.reward_breakdown is not None
+        assert result.info.judge_notes is not None
+        assert result.info.verdict == "accept"
+        assert "rigor" in result.info.judge_notes
+    def test_no_agreement_terminal_is_deterministic(self) -> None:
+        def run_timeout_episode():
+            env = ReplicaLabEnv()
+            scenario = _scenario()
+            env.reset(seed=42)
+            max_r = env.state().max_rounds
+            result = None
+            for _ in range(max_r):
+                result = env.step(_good_action(scenario))
+            return result
+        r1 = run_timeout_episode()
+        r2 = run_timeout_episode()
+        assert r1.reward == r2.reward
+        assert r1.info.verdict == r2.info.verdict
+    def test_timeout_verdict(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        max_r = env.state().max_rounds
+        result = None
+        for _ in range(max_r):
+            result = env.step(_good_action(scenario))
+        assert result.done
+        assert result.info.verdict == "timeout"
+        assert result.info.reward_breakdown is not None
+        assert result.reward == 0.0
+    def test_episode_state_stores_final_scores(self) -> None:
+        env = ReplicaLabEnv()
+        scenario = _scenario()
+        env.reset(seed=42)
+        env.step(_good_action(scenario))
+        env.step(_accept_action())
+        s = env.state()
+        assert s.done
+        assert s.agreement_reached
+        assert s.rigor_score > 0.0
+        assert s.feasibility_score > 0.0
+        assert s.fidelity_score > 0.0
+        assert s.reward > 0.0

tests/test_reward.py CHANGED Viewed

@@ -3,9 +3,15 @@
 from __future__ import annotations
 from replicalab.agents.lab_manager_policy import check_feasibility
-from replicalab.models import Protocol
 from replicalab.scenarios import generate_scenario
-from replicalab.scoring import score_feasibility, score_fidelity, score_rigor
 # ---------------------------------------------------------------------------
@@ -301,3 +307,100 @@ def test_good_protocol_dominates_bad_on_rigor_and_fidelity() -> None:
     assert score_rigor(good, scenario) > score_rigor(bad, scenario)
     assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)

 from __future__ import annotations
 from replicalab.agents.lab_manager_policy import check_feasibility
+from replicalab.models import Protocol, RewardBreakdown
 from replicalab.scenarios import generate_scenario
+from replicalab.scoring import (
+    build_reward_breakdown,
+    compute_total_reward,
+    score_feasibility,
+    score_fidelity,
+    score_rigor,
+)
 # ---------------------------------------------------------------------------
     assert score_rigor(good, scenario) > score_rigor(bad, scenario)
     assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)
+# ---------------------------------------------------------------------------
+# JDG 04 — compute_total_reward
+# ---------------------------------------------------------------------------
+def test_total_reward_perfect_beats_broken() -> None:
+    """A well-aligned protocol earns a higher total reward than a bad one."""
+    scenario = _scenario("ml_benchmark", "easy")
+    good = _good_protocol(scenario)
+    bad = _bad_protocol()
+    good_bd = build_reward_breakdown(good, scenario, rounds_used=1, max_rounds=6)
+    bad_bd = build_reward_breakdown(bad, scenario, rounds_used=1, max_rounds=6)
+    assert compute_total_reward(good_bd) > compute_total_reward(bad_bd)
+def test_zero_feasibility_zeroes_base() -> None:
+    """If any component is 0, the multiplicative base is 0."""
+    rb = RewardBreakdown(rigor=1.0, feasibility=0.0, fidelity=1.0)
+    assert compute_total_reward(rb) == 0.0
+def test_efficiency_bonus_higher_when_faster() -> None:
+    """Finishing in fewer rounds yields a higher total reward."""
+    scenario = _scenario()
+    protocol = _good_protocol(scenario)
+    fast = build_reward_breakdown(protocol, scenario, rounds_used=1, max_rounds=6)
+    slow = build_reward_breakdown(protocol, scenario, rounds_used=5, max_rounds=6)
+    assert compute_total_reward(fast) > compute_total_reward(slow)
+def test_penalty_subtraction_exact() -> None:
+    """Named penalties subtract exactly from the total."""
+    rb = RewardBreakdown(
+        rigor=1.0,
+        feasibility=1.0,
+        fidelity=1.0,
+        penalties={"invalid_tool_use": 2.0, "unsupported_claim": 0.5},
+    )
+    total = compute_total_reward(rb)
+    assert total == 7.5  # 10*1*1*1 - 2.5
+def test_total_reward_clamps_at_zero() -> None:
+    """Massive penalties cannot push the total below 0."""
+    rb = RewardBreakdown(
+        rigor=0.1,
+        feasibility=0.1,
+        fidelity=0.1,
+        penalties={"massive_penalty": 50.0},
+    )
+    assert compute_total_reward(rb) == 0.0
+def test_breakdown_determinism() -> None:
+    """Same inputs always produce the same total reward."""
+    scenario = _scenario("finance_trading", "medium")
+    protocol = _good_protocol(scenario)
+    b1 = build_reward_breakdown(protocol, scenario, rounds_used=3, max_rounds=6)
+    b2 = build_reward_breakdown(protocol, scenario, rounds_used=3, max_rounds=6)
+    assert compute_total_reward(b1) == compute_total_reward(b2)
+# ---------------------------------------------------------------------------
+# JDG 05 — build_reward_breakdown
+# ---------------------------------------------------------------------------
+def test_breakdown_accepts_external_penalties() -> None:
+    """Callers can inject named penalty keys via the penalties parameter."""
+    scenario = _scenario()
+    protocol = _good_protocol(scenario)
+    bd = build_reward_breakdown(
+        protocol, scenario, rounds_used=2, max_rounds=6,
+        penalties={"invalid_tool_use": 1.0},
+    )
+    assert "invalid_tool_use" in bd.penalties
+    assert bd.penalties["invalid_tool_use"] == 1.0
+def test_breakdown_no_penalties_by_default() -> None:
+    """Without external penalties, the dict is empty."""
+    scenario = _scenario()
+    protocol = _good_protocol(scenario)
+    bd = build_reward_breakdown(protocol, scenario, rounds_used=2, max_rounds=6)
+    assert bd.penalties == {}

tests/test_scientist_policy.py CHANGED Viewed

@@ -556,3 +556,347 @@ def test_baseline_scientist_finishes_stub_episode_without_crashing() -> None:
     assert second_step.done is True
     assert second_step.info.agreement_reached is True

     assert second_step.done is True
     assert second_step.info.agreement_reached is True
+# ---------------------------------------------------------------------------
+# AGT 08 — Extended prompt, parser, formatter, and baseline coverage
+# ---------------------------------------------------------------------------
+# --- Parser happy paths ---
+def test_parse_scientist_output_accepts_propose_protocol() -> None:
+    raw_text = """{
+      "action_type": "propose_protocol",
+      "sample_size": 48,
+      "controls": ["vehicle_control", "positive_control"],
+      "technique": "wst1_assay",
+      "duration_days": 5,
+      "required_equipment": ["plate_reader"],
+      "required_reagents": ["wst1", "dmso"],
+      "questions": [],
+      "rationale": "Standard viability assay with two controls."
+    }"""
+    action = parse_scientist_output(raw_text)
+    assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
+    assert action.sample_size == 48
+    assert action.technique == "wst1_assay"
+    assert action.controls == ["vehicle_control", "positive_control"]
+    assert action.questions == []
+def test_parse_scientist_output_accepts_accept_action() -> None:
+    raw_text = """{
+      "action_type": "accept",
+      "sample_size": 0,
+      "controls": [],
+      "technique": "",
+      "duration_days": 0,
+      "required_equipment": [],
+      "required_reagents": [],
+      "questions": [],
+      "rationale": ""
+    }"""
+    action = parse_scientist_output(raw_text)
+    assert action.action_type is ScientistActionType.ACCEPT
+    assert action.sample_size == 0
+    assert action.rationale == ""
+def test_parse_scientist_output_accepts_prose_wrapped_json() -> None:
+    raw_text = (
+        "After reviewing the constraints I think a request is in order.\n\n"
+        '{"action_type": "request_info", "sample_size": 0, '
+        '"controls": [], "technique": "", "duration_days": 0, '
+        '"required_equipment": [], "required_reagents": [], '
+        '"questions": ["Is the GPU available?"], "rationale": ""}\n\n'
+        "That should clarify the compute situation."
+    )
+    action = parse_scientist_output(raw_text)
+    assert action.action_type is ScientistActionType.REQUEST_INFO
+    assert action.questions == ["Is the GPU available?"]
+# --- Parser edge cases ---
+def test_parse_scientist_output_raises_on_empty_string() -> None:
+    with pytest.raises(ScientistOutputParseError) as exc_info:
+        parse_scientist_output("")
+    assert exc_info.value.code == "no_json"
+def test_parse_scientist_output_raises_on_whitespace_only() -> None:
+    with pytest.raises(ScientistOutputParseError) as exc_info:
+        parse_scientist_output("   \n\t  ")
+    assert exc_info.value.code == "no_json"
+def test_parse_scientist_output_raises_on_json_list() -> None:
+    # The parser's brace extractor finds the inner object from the list,
+    # so this surfaces as an invalid_action (missing required fields)
+    # rather than an invalid_json error.
+    with pytest.raises(ScientistOutputParseError) as exc_info:
+        parse_scientist_output('[{"action_type": "accept"}]')
+    assert exc_info.value.code == "invalid_action"
+def test_parse_scientist_output_raises_on_extra_forbidden_keys() -> None:
+    raw_text = """{
+      "action_type": "accept",
+      "sample_size": 0,
+      "controls": [],
+      "technique": "",
+      "duration_days": 0,
+      "required_equipment": [],
+      "required_reagents": [],
+      "questions": [],
+      "rationale": "",
+      "secret_field": "should not be here"
+    }"""
+    with pytest.raises(ScientistOutputParseError) as exc_info:
+        parse_scientist_output(raw_text)
+    assert exc_info.value.code == "invalid_action"
+    assert exc_info.value.parsed_payload is not None
+    assert "secret_field" in exc_info.value.parsed_payload
+def test_parse_error_to_dict_serialization() -> None:
+    try:
+        parse_scientist_output("no json here")
+    except ScientistOutputParseError as exc:
+        result = exc.to_dict()
+        assert result["code"] == "no_json"
+        assert result["raw_text"] == "no json here"
+        assert result["parsed_payload"] is None
+        assert "message" in result
+    else:
+        pytest.fail("Expected ScientistOutputParseError")
+def test_parse_error_to_dict_with_parsed_payload() -> None:
+    raw_text = """{
+      "action_type": "request_info",
+      "sample_size": 0,
+      "controls": [],
+      "technique": "",
+      "duration_days": 0,
+      "required_equipment": [],
+      "required_reagents": [],
+      "questions": [],
+      "rationale": ""
+    }"""
+    try:
+        parse_scientist_output(raw_text)
+    except ScientistOutputParseError as exc:
+        result = exc.to_dict()
+        assert result["code"] == "invalid_action"
+        assert result["parsed_payload"] is not None
+        assert result["parsed_payload"]["action_type"] == "request_info"
+    else:
+        pytest.fail("Expected ScientistOutputParseError")
+# --- System prompt: domain coverage ---
+def test_system_prompt_math_domain() -> None:
+    scenario = generate_scenario(seed=10, template="math_reasoning", difficulty="easy")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "Domain: mathematics" in prompt
+    assert scenario.task_summary in prompt
+    assert "You are the Scientist agent" in prompt
+def test_system_prompt_finance_domain() -> None:
+    scenario = generate_scenario(seed=10, template="finance_trading", difficulty="easy")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "Domain: finance_trading" in prompt
+    assert scenario.task_summary in prompt
+def test_system_prompt_ml_domain() -> None:
+    scenario = generate_scenario(seed=10, template="ml_benchmark", difficulty="easy")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "Domain: machine_learning" in prompt
+    assert scenario.task_summary in prompt
+def test_system_prompt_accepts_dict_input() -> None:
+    scenario = generate_scenario(seed=5, template="math_reasoning", difficulty="easy")
+    pack_dict = scenario.model_dump()
+    prompt = build_scientist_system_prompt(pack_dict)
+    assert "You are the Scientist agent" in prompt
+    assert scenario.task_summary in prompt
+    assert "Domain: mathematics" in prompt
+# --- System prompt: bounded-tool policy assertions ---
+def test_system_prompt_contains_bounded_tool_policy() -> None:
+    scenario = generate_scenario(seed=1, template="math_reasoning", difficulty="easy")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "search_evidence" in prompt
+    assert "run_code_check" in prompt
+    assert "inspect_image" in prompt
+def test_system_prompt_bounded_tool_policy_rules() -> None:
+    scenario = generate_scenario(seed=1, template="ml_benchmark", difficulty="medium")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "No unrestricted web browsing" in prompt
+    assert "No audio" in prompt
+    assert "do not override constraints" in prompt or "Tools do not override constraints" in prompt
+def test_system_prompt_bounded_tool_policy_present_in_all_domains() -> None:
+    for template in ("math_reasoning", "ml_benchmark", "finance_trading"):
+        scenario = generate_scenario(seed=42, template=template, difficulty="easy")
+        prompt = build_scientist_system_prompt(scenario)
+        assert "Bounded tool policy" in prompt, f"Missing in {template}"
+        assert "search_evidence" in prompt, f"Missing search_evidence in {template}"
+        assert "run_code_check" in prompt, f"Missing run_code_check in {template}"
+        assert "inspect_image" in prompt, f"Missing inspect_image in {template}"
+# --- System prompt: role-boundary assertions ---
+def test_system_prompt_contains_role_boundaries() -> None:
+    scenario = generate_scenario(seed=1, template="math_reasoning", difficulty="easy")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "do not invent resources" in prompt
+    assert "do not assume access to hidden ground truth" in prompt.lower() or \
+           "hidden ground truth" in prompt
+def test_system_prompt_contains_output_contract() -> None:
+    scenario = generate_scenario(seed=1, template="math_reasoning", difficulty="easy")
+    prompt = build_scientist_system_prompt(scenario)
+    assert "Output contract" in prompt
+    assert "exactly one JSON object" in prompt
+    assert "no extra keys" in prompt
+# --- Observation formatter edge cases ---
+def test_format_observation_final_round() -> None:
+    obs = _base_observation(round_number=5, max_rounds=6)
+    result = format_scientist_observation(obs)
+    assert "Round 5 of 6" in result
+    assert "Respond with exactly one JSON" in result
+def test_format_observation_protocol_with_empty_lists() -> None:
+    protocol = Protocol(
+        sample_size=1,
+        controls=[],
+        technique="minimal_check",
+        duration_days=1,
+        required_equipment=[],
+        required_reagents=[],
+        rationale="Minimal protocol.",
+    )
+    obs = _base_observation(current_protocol=protocol, round_number=1)
+    result = format_scientist_observation(obs)
+    assert "Current protocol:" in result
+    assert "technique: minimal_check" in result
+    assert "controls: (none)" in result
+    assert "required_equipment: (none)" in result
+    assert "required_reagents: (none)" in result
+# --- Baseline: domain inference ---
+def test_baseline_scientist_infers_ml_domain() -> None:
+    obs = _base_observation(
+        paper_title="Reproducing CIFAR-10 accuracy with ResNet",
+        paper_method="Train on CIFAR dataset with GPU",
+        experiment_goal="Match the published benchmark accuracy.",
+    )
+    action = build_baseline_scientist_action(obs)
+    assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
+    assert action.technique == "published_split_replication"
+def test_baseline_scientist_infers_finance_domain() -> None:
+    obs = _base_observation(
+        paper_title="Offline backtest of SPY mean-reversion",
+        paper_method="Daily bar backtest with slippage modeling",
+        experiment_goal="Evaluate Sharpe ratio under drawdown limits.",
+    )
+    action = build_baseline_scientist_action(obs)
+    assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
+    assert action.technique == "offline_backtest_workflow"
+def test_baseline_scientist_infers_math_domain() -> None:
+    obs = _base_observation(
+        paper_title="Planning a proof of AM-GM inequality",
+        paper_method="Algebraic manipulation with induction.",
+        experiment_goal="Verify the proof outline.",
+    )
+    action = build_baseline_scientist_action(obs)
+    assert action.action_type is ScientistActionType.PROPOSE_PROTOCOL
+    assert action.technique == "structured_proof_outline"
+# --- Baseline: forced accept at final round ---
+def test_baseline_scientist_accepts_at_final_round_even_with_blocker() -> None:
+    obs = _base_observation(
+        current_protocol=Protocol(
+            sample_size=20,
+            controls=["ctrl"],
+            technique="method_a",
+            duration_days=5,
+            required_equipment=[],
+            required_reagents=[],
+            rationale="Full scope plan.",
+        ),
+        conversation_history=[
+            ConversationEntry(
+                role="lab_manager",
+                message="Budget is tight and equipment is booked.",
+                round_number=4,
+                action_type="suggest_alternative",
+            ),
+        ],
+        round_number=5,
+        max_rounds=6,
+    )
+    action = build_baseline_scientist_action(obs)
+    assert action.action_type is ScientistActionType.ACCEPT

tests/test_server.py ADDED Viewed

	@@ -0,0 +1,604 @@

+"""Server endpoint tests.
+API 02 adds POST /reset endpoint tests.
+API 04 adds a smoke test for GET /scenarios.
+API 13 adds CORS middleware verification tests.
+API 03 adds POST /step endpoint tests.
+API 06 adds WebSocket session handler tests.
+API 07 adds idle-timeout and graceful disconnect cleanup tests.
+"""
+from __future__ import annotations
+import json
+import time
+from unittest.mock import patch
+import pytest
+from fastapi.testclient import TestClient
+from starlette.websockets import WebSocketDisconnect
+from server.app import app
+_EXPECTED_FAMILIES = {"math_reasoning", "ml_benchmark", "finance_trading"}
+_EXPECTED_DIFFICULTIES = ["easy", "medium", "hard"]
+@pytest.fixture()
+def client():
+    return TestClient(app)
+class TestScenariosEndpoint:
+    """GET /scenarios — API 04."""
+    def test_returns_200(self, client: TestClient):
+        resp = client.get("/scenarios")
+        assert resp.status_code == 200
+    def test_response_has_scenarios_key(self, client: TestClient):
+        data = client.get("/scenarios").json()
+        assert "scenarios" in data
+        assert isinstance(data["scenarios"], list)
+    def test_all_families_present(self, client: TestClient):
+        data = client.get("/scenarios").json()
+        families = {s["family"] for s in data["scenarios"]}
+        assert families == _EXPECTED_FAMILIES
+    def test_each_family_has_difficulties(self, client: TestClient):
+        data = client.get("/scenarios").json()
+        for entry in data["scenarios"]:
+            assert entry["difficulties"] == _EXPECTED_DIFFICULTIES
+    def test_no_extra_keys(self, client: TestClient):
+        data = client.get("/scenarios").json()
+        for entry in data["scenarios"]:
+            assert set(entry.keys()) == {"family", "difficulties"}
+# ---------------------------------------------------------------------------
+# POST /reset — API 02
+# ---------------------------------------------------------------------------
+class TestCorsConfiguration:
+    """API 13: CORS middleware for local frontend and HF Spaces."""
+    def test_preflight_allows_localhost_vite_origin(self, client: TestClient) -> None:
+        resp = client.options(
+            "/reset",
+            headers={
+                "Origin": "http://localhost:5173",
+                "Access-Control-Request-Method": "POST",
+            },
+        )
+        assert resp.status_code == 200
+        assert resp.headers["access-control-allow-origin"] == "http://localhost:5173"
+        assert resp.headers["access-control-allow-credentials"] == "true"
+    def test_preflight_allows_hf_space_origin(self, client: TestClient) -> None:
+        origin = "https://replicalab-demo.hf.space"
+        resp = client.options(
+            "/health",
+            headers={
+                "Origin": origin,
+                "Access-Control-Request-Method": "GET",
+            },
+        )
+        assert resp.status_code == 200
+        assert resp.headers["access-control-allow-origin"] == origin
+        assert resp.headers["access-control-allow-credentials"] == "true"
+    def test_preflight_rejects_unconfigured_origin(self, client: TestClient) -> None:
+        resp = client.options(
+            "/reset",
+            headers={
+                "Origin": "https://evil.example.com",
+                "Access-Control-Request-Method": "POST",
+            },
+        )
+        assert resp.status_code == 400
+        assert "access-control-allow-origin" not in resp.headers
+class TestResetEndpoint:
+    """POST /reset — API 02."""
+    def test_reset_returns_200_with_expected_keys(self, client: TestClient) -> None:
+        resp = client.post("/reset", json={"seed": 1})
+        assert resp.status_code == 200
+        data = resp.json()
+        assert "session_id" in data
+        assert "episode_id" in data
+        assert "observation" in data
+    def test_reset_observation_has_both_roles(self, client: TestClient) -> None:
+        data = client.post("/reset", json={"seed": 1}).json()
+        obs = data["observation"]
+        assert "scientist" in obs
+        assert "lab_manager" in obs
+        assert obs["scientist"]["paper_title"]
+        assert obs["lab_manager"]["budget_total"] > 0
+    def test_reset_with_explicit_session_id_reuses_slot(
+        self, client: TestClient
+    ) -> None:
+        """Passing session_id reuses the same slot and returns the same id."""
+        sid = "my-fixed-session"
+        d1 = client.post("/reset", json={"seed": 1, "session_id": sid}).json()
+        assert d1["session_id"] == sid
+        d2 = client.post("/reset", json={"seed": 2, "session_id": sid}).json()
+        assert d2["session_id"] == sid
+        # New episode each time
+        assert d2["episode_id"] != d1["episode_id"]
+    def test_reset_reuse_closes_prior_env(self, client: TestClient) -> None:
+        """Resetting with the same session_id produces a fresh episode."""
+        sid = "reuse-session"
+        d1 = client.post("/reset", json={"seed": 10, "session_id": sid}).json()
+        ep1 = d1["episode_id"]
+        d2 = client.post("/reset", json={"seed": 20, "session_id": sid}).json()
+        ep2 = d2["episode_id"]
+        assert ep1 != ep2
+    def test_reset_default_params(self, client: TestClient) -> None:
+        """Omitting scenario and difficulty uses defaults without error."""
+        resp = client.post("/reset", json={"seed": 0})
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data["observation"]["scientist"]["paper_title"]
+    def test_reset_custom_scenario_and_difficulty(self, client: TestClient) -> None:
+        for family in ("math_reasoning", "ml_benchmark", "finance_trading"):
+            for diff in ("easy", "medium", "hard"):
+                resp = client.post(
+                    "/reset",
+                    json={"seed": 42, "scenario": family, "difficulty": diff},
+                )
+                assert resp.status_code == 200, f"Failed for {family}/{diff}"
+                obs = resp.json()["observation"]
+                assert obs["scientist"]["paper_title"]
+                assert obs["lab_manager"]["budget_total"] > 0
+    def test_reset_deterministic_with_same_seed(self, client: TestClient) -> None:
+        """Same seed + scenario + difficulty → identical observations."""
+        params = {"seed": 99, "scenario": "math_reasoning", "difficulty": "medium"}
+        d1 = client.post("/reset", json=params).json()
+        d2 = client.post("/reset", json=params).json()
+        assert d1["observation"] == d2["observation"]
+        # Episode ids differ (new UUID each time)
+        assert d1["episode_id"] != d2["episode_id"]
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _reset(client: TestClient, **kwargs) -> dict:
+    """Reset and return the response JSON."""
+    payload = {"seed": 42, "scenario": "math_reasoning", "difficulty": "easy"}
+    payload.update(kwargs)
+    resp = client.post("/reset", json=payload)
+    assert resp.status_code == 200
+    return resp.json()
+def _good_action_payload(client: TestClient) -> dict:
+    """Build a valid propose_protocol action payload from a fresh scenario."""
+    from replicalab.scenarios import generate_scenario
+    scenario = generate_scenario(seed=42, template="math_reasoning", difficulty="easy")
+    lab = scenario.lab_manager_observation
+    spec = scenario.hidden_reference_spec
+    return {
+        "action_type": "propose_protocol",
+        "sample_size": 10,
+        "controls": ["baseline", "ablation"],
+        "technique": spec.summary[:60] if spec.summary else "replication_plan",
+        "duration_days": max(1, min(2, lab.time_limit_days)),
+        "required_equipment": (
+            list(lab.equipment_available[:1]) if lab.equipment_available else []
+        ),
+        "required_reagents": (
+            list(lab.reagents_in_stock[:1]) if lab.reagents_in_stock else []
+        ),
+        "questions": [],
+        "rationale": (
+            f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
+            f"Target metric: {spec.target_metric}. "
+            f"Target value: {spec.target_value}. "
+            "Stay within budget and schedule."
+        ),
+    }
+def _accept_action_payload() -> dict:
+    return {
+        "action_type": "accept",
+        "sample_size": 0,
+        "controls": [],
+        "technique": "",
+        "duration_days": 0,
+        "required_equipment": [],
+        "required_reagents": [],
+        "questions": [],
+        "rationale": "",
+    }
+# ---------------------------------------------------------------------------
+# POST /step — API 03
+# ---------------------------------------------------------------------------
+class TestStepEndpoint:
+    """POST /step — API 03."""
+    def test_reset_then_step_happy_path(self, client: TestClient) -> None:
+        """Reset, then step with a valid action → 200 with StepResult."""
+        reset_data = _reset(client)
+        session_id = reset_data["session_id"]
+        action = _good_action_payload(client)
+        resp = client.post("/step", json={"session_id": session_id, "action": action})
+        assert resp.status_code == 200
+        data = resp.json()
+        assert "observation" in data
+        assert "reward" in data
+        assert "done" in data
+        assert "info" in data
+        assert data["done"] is False
+        assert data["info"]["error"] is None
+    def test_step_invalid_session_returns_404(self, client: TestClient) -> None:
+        """Step with a non-existent session_id → 404."""
+        action = _good_action_payload(client)
+        resp = client.post(
+            "/step",
+            json={"session_id": "nonexistent-session-id", "action": action},
+        )
+        assert resp.status_code == 404
+        assert "Session not found" in resp.json()["detail"]
+    def test_terminal_step_returns_real_reward_breakdown(
+        self, client: TestClient
+    ) -> None:
+        """Propose → accept: terminal step has real reward_breakdown,
+        judge_notes, and verdict from the env (not stubs)."""
+        reset_data = _reset(client)
+        session_id = reset_data["session_id"]
+        # Step 1: propose
+        action = _good_action_payload(client)
+        resp1 = client.post("/step", json={"session_id": session_id, "action": action})
+        assert resp1.status_code == 200
+        assert resp1.json()["done"] is False
+        # Step 2: accept
+        resp2 = client.post(
+            "/step",
+            json={"session_id": session_id, "action": _accept_action_payload()},
+        )
+        assert resp2.status_code == 200
+        data = resp2.json()
+        assert data["done"] is True
+        assert data["reward"] > 0.0
+        info = data["info"]
+        assert info["agreement_reached"] is True
+        assert info["verdict"] == "accept"
+        assert info["judge_notes"] is not None
+        assert "rigor" in info["judge_notes"]
+        rb = info["reward_breakdown"]
+        assert rb is not None
+        assert 0.0 <= rb["rigor"] <= 1.0
+        assert 0.0 <= rb["feasibility"] <= 1.0
+        assert 0.0 <= rb["fidelity"] <= 1.0
+        # Verify it's not the old stub 0.8
+        assert not (rb["rigor"] == 0.8 and rb["feasibility"] == 0.8 and rb["fidelity"] == 0.8)
+    def test_semantic_invalid_action_returns_200_with_error(
+        self, client: TestClient
+    ) -> None:
+        """A semantically invalid action (e.g. duration=999) returns 200
+        with info.error set, not a crash or 422."""
+        reset_data = _reset(client)
+        session_id = reset_data["session_id"]
+        bad_action = {
+            "action_type": "propose_protocol",
+            "sample_size": 5,
+            "controls": ["baseline"],
+            "technique": "some technique",
+            "duration_days": 999,
+            "required_equipment": [],
+            "required_reagents": [],
+            "questions": [],
+            "rationale": "Duration is impossibly long for the lab time limit.",
+        }
+        resp = client.post(
+            "/step", json={"session_id": session_id, "action": bad_action}
+        )
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data["done"] is False
+        assert data["info"]["error"] is not None
+        assert "Validation errors" in data["info"]["error"]
+    def test_replay_uses_real_judge_data(self, client: TestClient) -> None:
+        """After a terminal step, GET /replay/{episode_id} returns
+        real judge_notes and verdict, not stub values."""
+        reset_data = _reset(client)
+        session_id = reset_data["session_id"]
+        episode_id = reset_data["episode_id"]
+        # Propose then accept
+        action = _good_action_payload(client)
+        client.post("/step", json={"session_id": session_id, "action": action})
+        client.post(
+            "/step",
+            json={"session_id": session_id, "action": _accept_action_payload()},
+        )
+        # Fetch replay
+        resp = client.get(f"/replay/{episode_id}")
+        assert resp.status_code == 200
+        replay = resp.json()
+        assert replay["agreement_reached"] is True
+        assert "rigor" in replay["judge_notes"]
+        assert replay["verdict"] == "accept"
+        assert replay["reward_breakdown"] is not None
+        assert replay["total_reward"] > 0.0
+        # Not the old stub string
+        assert "Stub audit" not in replay["judge_notes"]
+# ---------------------------------------------------------------------------
+# WebSocket handler — API 06
+# ---------------------------------------------------------------------------
+def _ws_send_recv(ws, msg: dict) -> dict:
+    """Send a JSON message over the WebSocket and return the parsed response."""
+    ws.send_text(json.dumps(msg))
+    return json.loads(ws.receive_text())
+class TestWebSocket:
+    """API 06: WebSocket session handler with isolated env per connection."""
+    # -- basic connectivity --------------------------------------------------
+    def test_ws_ping_pong(self, client: TestClient) -> None:
+        with client.websocket_connect("/ws") as ws:
+            resp = _ws_send_recv(ws, {"type": "ping"})
+            assert resp["type"] == "pong"
+    def test_ws_reset_returns_observation(self, client: TestClient) -> None:
+        with client.websocket_connect("/ws") as ws:
+            resp = _ws_send_recv(ws, {
+                "type": "reset", "seed": 42,
+                "scenario": "math_reasoning", "difficulty": "easy",
+            })
+            assert resp["type"] == "reset_ok"
+            assert resp["episode_id"]
+            obs = resp["observation"]
+            assert obs["scientist"]["paper_title"]
+            assert obs["lab_manager"]["budget_total"] > 0
+    def test_ws_step_returns_result(self, client: TestClient) -> None:
+        action = _good_action_payload(client)
+        with client.websocket_connect("/ws") as ws:
+            _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            resp = _ws_send_recv(ws, {"type": "step", "action": action})
+            assert resp["type"] == "step_ok"
+            assert resp["done"] is False
+            assert resp["reward"] == 0.0
+            assert resp["observation"] is not None
+    def test_ws_full_episode_real_reward(self, client: TestClient) -> None:
+        """Propose → accept returns real reward breakdown, not stub 0.8."""
+        action = _good_action_payload(client)
+        with client.websocket_connect("/ws") as ws:
+            _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            _ws_send_recv(ws, {"type": "step", "action": action})
+            resp = _ws_send_recv(ws, {"type": "step", "action": _accept_action_payload()})
+            assert resp["type"] == "step_ok"
+            assert resp["done"] is True
+            assert resp["reward"] > 0.0
+            info = resp["info"]
+            assert info["agreement_reached"] is True
+            assert info["verdict"] == "accept"
+            rb = info["reward_breakdown"]
+            assert rb is not None
+            assert 0.0 <= rb["rigor"] <= 1.0
+            assert 0.0 <= rb["feasibility"] <= 1.0
+            assert 0.0 <= rb["fidelity"] <= 1.0
+            assert not (rb["rigor"] == 0.8 and rb["feasibility"] == 0.8)
+    # -- error handling ------------------------------------------------------
+    def test_ws_invalid_json(self, client: TestClient) -> None:
+        with client.websocket_connect("/ws") as ws:
+            ws.send_text("not valid json {{{")
+            resp = json.loads(ws.receive_text())
+            assert resp["type"] == "error"
+            assert "Invalid JSON" in resp["message"]
+    def test_ws_missing_action_field(self, client: TestClient) -> None:
+        with client.websocket_connect("/ws") as ws:
+            _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            resp = _ws_send_recv(ws, {"type": "step"})
+            assert resp["type"] == "error"
+            assert "Missing" in resp["message"]
+    def test_ws_invalid_action_payload(self, client: TestClient) -> None:
+        """Structurally invalid action (missing required fields) → WS error."""
+        with client.websocket_connect("/ws") as ws:
+            _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            resp = _ws_send_recv(ws, {
+                "type": "step",
+                "action": {"action_type": "propose_protocol"},
+            })
+            assert resp["type"] == "error"
+            assert "Invalid action" in resp["message"]
+    def test_ws_unknown_message_type(self, client: TestClient) -> None:
+        with client.websocket_connect("/ws") as ws:
+            resp = _ws_send_recv(ws, {"type": "banana"})
+            assert resp["type"] == "error"
+            assert "Unknown" in resp["message"]
+    # -- session isolation ---------------------------------------------------
+    def test_ws_session_isolation(self, client: TestClient) -> None:
+        """Two WebSocket connections have independent env state."""
+        action = _good_action_payload(client)
+        with client.websocket_connect("/ws") as ws1:
+            r1 = _ws_send_recv(ws1, {"type": "reset", "seed": 1})
+            _ws_send_recv(ws1, {"type": "step", "action": action})
+            with client.websocket_connect("/ws") as ws2:
+                r2 = _ws_send_recv(ws2, {"type": "reset", "seed": 2})
+                assert r1["episode_id"] != r2["episode_id"]
+                # ws2 is at round 0, ws1 is at round 1
+                step2 = _ws_send_recv(ws2, {"type": "step", "action": action})
+                assert step2["observation"]["scientist"]["round_number"] == 1
+    # -- real-env integration (user-requested) --------------------------------
+    def test_ws_semantic_invalid_action_returns_step_ok_with_info_error(
+        self, client: TestClient
+    ) -> None:
+        """A structurally valid but semantically invalid action (e.g.
+        duration_days=999) returns step_ok with info.error — NOT a
+        transport-level WS error frame."""
+        with client.websocket_connect("/ws") as ws:
+            _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            bad_action = {
+                "action_type": "propose_protocol",
+                "sample_size": 5,
+                "controls": ["baseline"],
+                "technique": "some technique",
+                "duration_days": 999,
+                "required_equipment": [],
+                "required_reagents": [],
+                "questions": [],
+                "rationale": "Duration is impossibly long for the lab.",
+            }
+            resp = _ws_send_recv(ws, {"type": "step", "action": bad_action})
+            assert resp["type"] == "step_ok"
+            assert resp["done"] is False
+            assert resp["info"]["error"] is not None
+            assert "Validation errors" in resp["info"]["error"]
+    def test_ws_timeout_verdict(self, client: TestClient) -> None:
+        """Run to max_rounds without accept → done=True, verdict=timeout,
+        reward=0.0. Proves real-env integration."""
+        action = _good_action_payload(client)
+        with client.websocket_connect("/ws") as ws:
+            reset_resp = _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            max_rounds = reset_resp["observation"]["scientist"]["max_rounds"]
+            resp = None
+            for _ in range(max_rounds):
+                resp = _ws_send_recv(ws, {"type": "step", "action": action})
+            assert resp["done"] is True
+            assert resp["info"]["verdict"] == "timeout"
+            assert resp["reward"] == 0.0
+            assert resp["info"]["reward_breakdown"] is not None
+    def test_ws_terminal_episode_persists_real_replay_log(
+        self, client: TestClient
+    ) -> None:
+        """Complete a WS episode, then verify GET /replay/{episode_id}
+        returns real reward_breakdown, judge_notes, and verdict —
+        not stub strings."""
+        action = _good_action_payload(client)
+        with client.websocket_connect("/ws") as ws:
+            reset_resp = _ws_send_recv(ws, {"type": "reset", "seed": 42})
+            episode_id = reset_resp["episode_id"]
+            _ws_send_recv(ws, {"type": "step", "action": action})
+            _ws_send_recv(ws, {"type": "step", "action": _accept_action_payload()})
+        # Fetch replay via REST after WS connection is closed
+        replay_resp = client.get(f"/replay/{episode_id}")
+        assert replay_resp.status_code == 200
+        replay = replay_resp.json()
+        assert replay["agreement_reached"] is True
+        assert replay["verdict"] == "accept"
+        assert replay["total_reward"] > 0.0
+        # Real judge_notes, not stub
+        assert replay["judge_notes"] != ""
+        assert "Stub audit" not in replay["judge_notes"]
+        assert "rigor" in replay["judge_notes"]
+        # Real reward_breakdown with non-stub scores
+        rb = replay["reward_breakdown"]
+        assert rb is not None
+        assert 0.0 < rb["rigor"] <= 1.0
+        assert 0.0 < rb["feasibility"] <= 1.0
+        assert 0.0 < rb["fidelity"] <= 1.0
+        assert not (rb["rigor"] == 0.8 and rb["feasibility"] == 0.8)
+    # -- idle timeout & disconnect cleanup (API 07) -------------------------
+    def test_ws_idle_timeout_closes_connection(self, client: TestClient) -> None:
+        """API 07: server closes WebSocket after idle timeout (no messages)."""
+        with patch("server.app._WS_IDLE_TIMEOUT", 0.5):
+            with client.websocket_connect("/ws") as ws:
+                # Don't send anything — let the server-side timeout fire
+                time.sleep(1.0)
+                with pytest.raises(WebSocketDisconnect) as exc_info:
+                    ws.receive_text()
+                assert exc_info.value.code == 1000
+    def test_ws_env_closes_on_disconnect(self, client: TestClient) -> None:
+        """API 07: env.close() runs in the finally block on disconnect."""
+        import server.app as _app
+        _original_make_env = _app._make_env
+        close_called: list[bool] = []
+        def _tracked_make_env():
+            env = _original_make_env()
+            _original_close = env.close
+            def _tracking_close():
+                close_called.append(True)
+                _original_close()
+            env.close = _tracking_close
+            return env
+        with patch.object(_app, "_make_env", _tracked_make_env):
+            with client.websocket_connect("/ws") as ws:
+                _ws_send_recv(ws, {"type": "ping"})
+            # Context manager exit sends disconnect; server runs finally block
+            # TestClient joins the ASGI thread, so close() has already run
+            assert len(close_called) == 1