Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /completion.md

maxxie114

Initial HF Spaces deployment

80d8c84 2 days ago

preview code

raw

history blame contribute delete

78.4 kB

ReplicaLab Task Completion Tracker

Source of truth: ReplicaLab_Comprehensive_Task_Division.md

Working Governance Files

File	Role
`AGENTS.md`	Required startup and close-out rules for contributors and automated model agents
`docs/project_management_rules.md`	Detailed project-management workflow
`docs/changes.md`	Append-only deviation log
`docs/<owner>/`	Owner-local task and planning docs

Overall Completion

Metric	Value
Total tasks	152
Completed	152
Partial / active	0
Remaining	0
Completion rate	100.00%

Post-MVP benchmark note:

On 2026-03-09, a live Northflank H100 first-step benchmark was added as an operational post-MVP artifact under replicalab/outputs/training/h100-one-step-500-20260309/.
It covers 500 total simulations (250 shared reset cases × baseline and trained first-step actions) and records paper-understanding regression data for the current saved Scientist adapter.

Completion by Person

Person	Assigned	Completed (own)	Completed (by others)	Rate
Kian (Person A)	49 (47 solo + 2 shared with B)	1 shared sign-off (`FND 08`)	48 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 06`, `MOD 08`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `SCN 13`, `AGT 05`, `AGT 09`, `ENV 01` to `ENV 08`, `ENV 10`, `ENV 11`, `JDG 01` to `JDG 06`, `JDG 08`, `JDG 11`, `OBS 04`, `TST 01` to `TST 05` done by Person B)	100.00%
Person B (Ayush)	29 (27 solo + 2 shared with A)	29 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 08`, `AGT 10`, `AGT 11`, `JDG 10`, `TRN 01` to `TRN 10`, `TRN 13`, `TRN 14`, `TRN 15`, `OBS 06`, `TST 09`)	0	100.00%
Max (Person C)	41	1 (`FND 11`)	40 (done by Person B or Person D; `API 16`, `UI 11` by Kush)	100.00%
Kush (Person D)	32	17 (`FND 13`, `UI 01`-`UI 06`, `UI 07`-`UI 09`, `UI 10`, `UI 11`, `UI 13`-`UI 15`, `JDG 09`, `OBS 05`)	15 (by Person B: `FND 06`, `SCN 12`, `API 12`, `TRN 12`, `UI 12`, `OBS 08`, `TST 08`, `TST 12`, `DOC 01`-`DOC 07`, `DOC 09`, `DOC 11`)	100%
All (shared)	3	3 (`FND 08`, `AGT 05`, `TST 10`)	0	100.00%

All 152 tasks are now complete (100%). Every person's lane is closed:

Kian (Person A): 49/49 (done by Person B)
Ayush (Person B): 29/29
Max (Person C): 41/41 (done by Person B and Kush)
Kush (Person D): 32/32 (17 by Kush, 15 by Person B)
Shared: 3/3

Active Partial Tasks

ID	Assigned To	Current Status	Remaining Acceptance Item
—	—	No active partial tasks	—

Completed Tasks

Person B (Ayush) - Completed on behalf of others

ID	Epic	Assigned To	Task	File/Module	Date	What Was Done	Acceptance Criteria	Verified
FND 01	E01	Person C	Create repo structure and base folders from agreed layout	repo root	2026-03-07	Created the full repo scaffold: `replicalab/` with subdirectories for `agents/`, `env/`, `prompts/`, `scenarios/`, `scoring/`, `utils/`; `server/`; `frontend/` with `src/components/` and `src/pages/`; `notebooks/`; `tests/`. All directories tracked via `.gitkeep` files.	All top level folders exist and repo clones cleanly	Yes
FND 02	E01	Person C	Add Python project config and dependencies placeholder	`pyproject.toml`	2026-03-08	Added a PEP 621 `pyproject.toml` with package metadata, Python 3.10+ requirement, runtime dependencies (`pydantic`, `fastapi`, `uvicorn`, `websockets`), dev extras (`pytest`, `pytest-cov`, `ruff`, `mypy`), package discovery, and pytest test-path settings.	Project installs locally without missing package errors for base modules	Yes - verified with `python -m pip install -e .`, `python -m pip install -e ".[dev]"`, and `python -c "from replicalab.models import ..."`
FND 04	E01	Person A	Add empty Pydantic models and shared type names	`replicalab/models.py`	2026-03-08	Created `replicalab/__init__.py` and `replicalab/models.py` with the shared action, observation, step, state, and log stubs.	Import paths resolve for all placeholder models	Yes - verified with `python -c "from replicalab.models import ..."`
FND 05	E01	Person C	Add ignore rules for Python, Node, logs, notebooks, and build artifacts	`.gitignore`, `.dockerignore`	2026-03-08	Added `.dockerignore` and expanded `.gitignore` for caches, coverage artifacts, notebook checkpoints, frontend build files, and generated outputs while preserving tracked `.gitkeep` files.	Repo status stays clean after local run and build, and Docker build excludes non-runtime files	Yes
FND 06	E01	Person D	Add temporary project stub with title, mission, team roles, and local setup placeholder	`README.md`	2026-03-08	Replaced the aspirational README with a temporary foundation stub that reflects the current repo state, mission, ownership, and verified setup placeholder.	New contributor can understand repo purpose in under two minutes	Yes
FND 07	E01	Person C	Define branch naming, PR template, and issue template	`.github/` and repo workflow docs	2026-03-08	Added `.github/pull_request_template.md` and `.github/ISSUE_TEMPLATE/task.yml`, and documented preferred branch naming patterns plus required tracking-doc updates in `docs/project_management_rules.md`.	All future PRs auto show the template and issue fields	Yes
FND 09	E01	Person A	Create OpenEnv configuration file specifying environment class, action and observation types, and server settings	`openenv.yaml`, `pyproject.toml`, `server/app.py`, `uv.lock`	2026-03-08	Added `openenv.yaml`, recorded the environment and contract metadata for OpenEnv, added `openenv-core` plus a `server` script entry point to `pyproject.toml`, added `main()` to `server/app.py`, and generated `uv.lock` so the repo passes local OpenEnv validation.	OpenEnv can discover and serve the environment using this config file	Yes - verified with `uv lock` and `openenv validate`
FND 10	E01	Person C	Create output directory structure	`replicalab/outputs/`	2026-03-07	Created `replicalab/outputs/` with three subdirectories: `logs/`, `replays/`, and `plots/`, all tracked via `.gitkeep` files.	Output directories exist and generated files are not committed to git	Yes
MOD 01	E02	Person A	Implement `ScientistAction` schema	`replicalab/models.py`, `tests/test_models.py`, `server/app.py`	2026-03-08	Replaced the `ScientistAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so `accept` preserves the current protocol.	Valid scientist actions parse and invalid fields raise validation errors	Yes - verified with `python -m pytest tests/test_models.py` and a stub-env `ScientistAction.model_validate(...)` smoke step
MOD 02	E02	Person A	Implement `LabManagerAction` schema	`replicalab/models.py`, `tests/test_models.py`	2026-03-08	Replaced the `LabManagerAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests.	Valid lab manager actions parse and invalid fields raise validation errors	Yes - verified with `python -m pytest tests/test_models.py`
MOD 03	E02	Person A	Implement role specific observation models	`replicalab/models.py`, `tests/test_models.py`, `server/app.py`	2026-03-08	Added typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server.	Scientist and lab observations serialize to JSON with stable keys	Yes - verified with `python -m pytest tests/test_models.py` and a stub `reset()` / `step()` JSON smoke test
MOD 04	E02	Person A	Implement `EpisodeState` and `EpisodeLog` models	`replicalab/models.py`, `server/app.py`, `tests/test_models.py`	2026-03-08	Replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs.	Full state round trip serialize plus deserialize works	Yes - verified with `python -m pytest tests/test_models.py`
MOD 05	E02	Person A	Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab	`replicalab/utils/validation.py`, `tests/test_models.py`, `tests/test_scenarios.py`	2026-03-08	Added deterministic semantic protocol validation with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack.	Invalid protocol examples are rejected with readable reasons	Yes - verified with `python -m pytest tests/test_models.py tests/test_scenarios.py`
MOD 06	E02	Person A	Add semantic validators for impossible plans such as zero sample size with positive controls	`replicalab/utils/validation.py`, `tests/test_validation.py`	2026-03-08	Added `_check_semantic_impossibilities()` with five checks: zero sample with controls (error), controls >= sample size (error), duplicate controls (warning), duplicate equipment (warning), duplicate reagents (warning). Seven new tests cover all cases plus a regression guard confirming valid protocols still pass.	Semantic validator catches at least five invalid edge cases	Yes - verified with `python -m pytest tests/test_validation.py` (20 tests pass) and full suite (223 passed)
MOD 07	E02	Person C	Add state serialization helper for replay logs	`replicalab/utils/logging.py`, `tests/test_logging.py`	2026-03-08	Added file-based replay persistence helpers with atomic JSON writes (`write_episode_log`, `load_episode_log`) plus CSV reward logging (`append_reward_csv`). Eleven tests cover lossless round-trip, filename behavior, nested directory creation, transcript and reward-breakdown preservation, CSV headers, append semantics, missing-file errors, and default output targets.	State logs can be written and loaded without loss	Yes - verified with `python -m pytest tests/test_logging.py` (11 tests pass)
MOD 10	E02	Person C	Publish schema examples for frontend and notebook clients	`tests/fixtures/generate_api_examples.py`, `tests/fixtures/api_schema_examples.json`	2026-03-08	Added a deterministic generator that builds canonical REST and WebSocket example payloads from real Pydantic models and seeded scenario data, then writes a shared `api_schema_examples.json` fixture for frontend and notebook consumers. The generated examples now use the current deterministic judge metadata instead of stale stub text.	Frontend and notebook can mock against shared sample payloads	Yes - verified with `python tests/fixtures/generate_api_examples.py` and fixture review
MOD 11	E02	Person A	Implement `StepResult` model	`replicalab/models.py`, `server/app.py`, `tests/test_models.py`	2026-03-08	Added typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly.	Step result serializes cleanly and all consumers agree on its shape	Yes - verified with `python -m pytest tests/test_models.py`
MOD 12	E02	Person A	Create environment configuration module with shared constants	`replicalab/config.py`, `server/app.py`, `replicalab/scenarios/*.py`, `tests/test_config.py`	2026-03-08	Added a shared configuration module for default scenario and difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, then updated the server and scenario builders to import those constants instead of repeating literals.	All modules import config from one place and no magic numbers remain in env or scoring code	Yes - verified with `python -m pytest tests/test_config.py tests/test_scenarios.py`
SCN 01	E03	Person A	Implement deterministic RNG helper `seed_rng()`	`replicalab/utils/seed.py`, `replicalab/scenarios/templates.py`	2026-03-08	Added deterministic seed helpers that derive reproducible RNG namespaces for scenario generation.	Same seed always yields the same random choices and the seed utility is importable from scenarios and env	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 02	E03	Person A	Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec	`replicalab/scenarios/templates.py`	2026-03-08	Added `NormalizedScenarioPack` plus strict `ScenarioConstraint`, `ScenarioResource`, `AllowedSubstitution`, and `HiddenReferenceSpec` models to standardize all scenario families.	All scenario builders return the same normalized top-level structure and mapper-ready inputs	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 03	E03	Person A	Implement mathematics template	`replicalab/scenarios/math_reasoning.py`	2026-03-08	Added deterministic mathematics planning templates covering theorem, proof-goal, review, and time constraints.	Generated scenario passes structure and internal consistency tests	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 04	E03	Person A	Implement ML benchmark template	`replicalab/scenarios/ml_benchmark.py`	2026-03-08	Added deterministic ML benchmark templates covering dataset, compute, time, and evaluation constraints.	Generated scenario passes structure and internal consistency tests	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 05	E03	Person A	Implement finance and trading planning template	`replicalab/scenarios/finance_trading.py`	2026-03-08	Added deterministic offline finance and trading planning templates covering capital, drawdown, slippage, and backtest rules.	Generated scenario passes structure and internal consistency tests	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 06	E03	Person A	Implement difficulty application for easy, medium, hard	`replicalab/scenarios/templates.py`, `tests/test_scenarios.py`	2026-03-08	Added mechanical difficulty scaling that adjusts budgets, time, staff, resource availability, and injected conflict constraints across easy, medium, and hard.	Difficulty visibly changes the normalized scenario pack in a meaningful way	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 07	E03	Person A	Implement normalized constraint and resource generator	`replicalab/scenarios/templates.py`, `tests/test_scenarios.py`	2026-03-08	Added normalized constraint and resource mapping into role-specific observations with consistency checks for unique keys and non-contradictory generated packs.	No generated scenario contains contradictory constraints or resources	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 08	E03	Person A	Implement hidden reference spec and allowed substitutions per template	`replicalab/scenarios/templates.py`, `tests/test_scenarios.py`	2026-03-08	Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically.	Hidden reference clearly marks what is fixed versus flexible for deterministic scoring	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 09	E03	Person A	Implement `generate_scenario(seed, template, difficulty)`	`replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py`	2026-03-08	Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list.	Function returns a full scenario with deterministic content	Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test
SCN 10	E03	Person A	Add seeded generation tests and consistency tests	`tests/test_scenarios.py`	2026-03-08	Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine.	Same seed plus template returns the same scenario and different seeds vary	Yes - verified with `python -m pytest tests/test_scenarios.py`
SCN 13	E03	Person A	Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration	`replicalab/scenarios/templates.py`, `replicalab/scenarios/__init__.py`, `tests/test_scenarios.py`	2026-03-08	Added typed `ResourceBooking` and `SchedulingWindow` models, extended `NormalizedScenarioPack` with deterministic booking and scheduling data, wired seeded generation into the scenario builder across all three domains, and added five scenario tests covering determinism, easy-mode no-conflict behavior, JSON round-trip, valid windows, and domain coverage.	Constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability	Yes - verified with `python -m pytest tests/test_scenarios.py` (13 tests pass) and full suite (`304 passed`)
AGT 09	E04	Person A	Add deterministic feasibility checker tests for Lab Manager grounding	`tests/test_lab_manager_policy.py`	2026-03-08	Added seventeen deterministic regression tests covering `check_feasibility(...)`, `suggest_alternative(...)`, and `compose_lab_manager_response(...)` across all three domains, including repeated-run determinism, substitution ordering, duration and sample-size revision stability, never-worsens checks, action-type branching, flag mirroring, and explanation stability.	Same proposal plus same normalized scenario returns the same checker results every time	Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` (37 tests pass) and full suite (`304 passed`)
ENV 01	E06	Person A	Create `ReplicaLabEnv` class skeleton	`replicalab/env/replicalab_env.py`, `replicalab/env/__init__.py`	2026-03-08	Added a real `ReplicaLabEnv` module as a drop-in replacement for the former in-server stub, ported the working stub behavior into the environment package, wired scenario-pack-backed reset or step or state or close methods with follow-on `TODO(ENV XX)` markers, and removed the old stub-only marker from `StepInfo` payloads.	Environment class imports and instantiates without runtime errors	Yes - verified with a direct `ReplicaLabEnv.reset(...) -> step(...) -> state() -> close()` smoke run and `python -m pytest` (`111 passed`)
JDG 01	E05	Person A	Implement rigor or objective-validity score	`replicalab/scoring/rigor.py`, `replicalab/utils/text.py`, `tests/test_reward.py`	2026-03-08	Added `score_rigor(protocol, scenario)` with weighted sub-scores for structural completeness (0.30), success criteria coverage (0.40), and required element coverage (0.30). Uses shared `element_tokens` from `replicalab/utils/text.py`. Five focused tests in `test_reward.py` cover quality ordering, determinism, controls impact, rationale length, and all-domain range validation.	Score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning	Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass)
JDG 02	E05	Person A	Implement feasibility score	`replicalab/scoring/feasibility.py`, `tests/test_reward.py`	2026-03-08	Added `score_feasibility(protocol, scenario, check=None)` that derives a continuous [0,1] signal from `FeasibilityCheckResult` (AGT 05). Seven dimensions weighted equally (1/7) with partial credit for budget, equipment, reagents, and staff. Accepts optional pre-computed check to avoid redundant work. Six focused tests cover viable protocol, infeasible ordering, pre-computed check equivalence, determinism, partial credit, and all-domain range.	Score is between 0 and 1 and matches normalized constraint logic	Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass)
JDG 03	E05	Person A	Implement fidelity score	`replicalab/scoring/fidelity.py`, `tests/test_reward.py`	2026-03-08	Added `score_fidelity(protocol, scenario)` with substitution-aware scoring: required element coverage (0.50, direct match=1.0, substitution=0.7), flexible element alignment (0.20, bonus only), target metric alignment (0.20), and technique appropriateness (0.10). Five focused tests cover aligned vs misaligned ordering, determinism, substitution partial credit, target metric impact, and all-domain range.	Score is between 0 and 1 and matches rubric examples for plan and evidence alignment	Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass)
JDG 04	E05	Person A	Implement total reward formula	`replicalab/scoring/rubric.py`, `tests/test_reward.py`	2026-03-07	`compute_total_reward(breakdown)` implements `10 × rigor × feasibility × fidelity + bonuses − penalties` with `max(0.0, ...)` floor clamp. Eight new tests in `test_reward.py` verify perfect-vs-broken ordering, zero-feasibility behavior, efficiency bonus ordering, exact penalty subtraction, zero-clamp floor, determinism, external penalties injection, and default-empty penalties. Seven existing rubric tests in `test_env.py` also cover the formula.	Total reward formula matches agreed math, clamps at zero, and returns consistent output for plan quality and bounded tool behavior	Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass)
JDG 05	E05	Person A	Build reward breakdown object	`replicalab/scoring/rubric.py`, `replicalab/scoring/__init__.py`, `tests/test_reward.py`	2026-03-07	`build_reward_breakdown(...)` accepts an optional `penalties: dict[str, float]` parameter for named penalty keys (e.g. `invalid_tool_use`, `unsupported_claim`) from bounded-tool diagnostics without reopening the model contract. Returns a typed `RewardBreakdown` with rigor, feasibility, fidelity, efficiency_bonus, communication_bonus, and penalties dict. Exported through `replicalab.scoring`.	Breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics extension point	Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass)
JDG 06	E05	Person A	Add optional plain English explanation function from reward breakdown	`replicalab/scoring/explain.py`, `replicalab/scoring/__init__.py`, `tests/test_reward.py`	2026-03-08	Added `explain_reward(...)`, a deterministic explanation builder that mirrors rigor, feasibility, fidelity, bonuses, penalties, and total reward with stable quality-tier labels and without introducing any new scoring logic. Exported through `replicalab.scoring` and covered by nine focused tests.	Explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic	Yes - verified with `python -m pytest tests/test_reward.py` (40 tests pass)
JDG 08	E05	Person A	Add score determinism tests and edge case tests	`tests/test_reward.py`	2026-03-08	Added six focused regression tests covering good-vs-awful ordering across all judge axes and total reward, success-criteria sensitivity in rigor scoring, partial equipment credit ordering in feasibility scoring, direct-match vs allowed-substitution vs miss ordering in fidelity scoring, and reward-breakdown determinism with and without a precomputed feasibility check.	Perfect and broken protocols produce expected relative ordering and scoring remains deterministic across edge cases	Yes - verified with `python -m pytest tests/test_reward.py` (40 tests pass) and `python -m pytest -q` (264 passed)
JDG 11	E05	Person A	Add structured final audit payload with judge_notes, verdict, and top failure reasons	`replicalab/agents/judge_policy.py`, `replicalab/agents/__init__.py`, `tests/test_judge_policy.py`	2026-03-08	Created `JudgeAudit` model and `build_judge_audit()` builder that derives verdict (`accept`/`timeout`/`no_agreement`), reuses `explain_reward()` for `judge_notes`, and extracts top failure reasons from weak rubric components and penalty keys. Exported through `replicalab.agents`. Ten tests cover all three verdict paths, component-driven failure reasons, penalty surfacing, reason cap, good-protocol empty reasons, determinism, and JSON round-trip.	Final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI	Yes - verified with `python -m pytest tests/test_judge_policy.py` (10 tests pass) and full suite (255 passed)
ENV 02	E06	Person A	Implement real reset wiring	`replicalab/env/replicalab_env.py`	2026-03-08	`_make_observation()` now uses the scenario pack as source of truth for booked/out-of-stock/safety data instead of empty placeholders. Eight reset tests verify both roles populated, booked/out-of-stock preserved, all templates and difficulties.	Reset returns initial observations with full scenario data	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
ENV 03	E06	Person A	Implement Scientist turn with validation	`replicalab/env/replicalab_env.py`	2026-03-08	Added `_validate_scientist_action()` that runs `validate_protocol()` on proposals and returns structured error strings without crashing the env. Invalid actions don't advance the round.	Valid action updates state, invalid action returns structured error	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
ENV 04	E06	Person A	Implement Lab Manager response step	`replicalab/env/replicalab_env.py`	2026-03-08	`_lab_manager_action()` uses the full grounded pipeline: `check_feasibility()` → `suggest_alternative()` → `compose_lab_manager_response()`.	Lab Manager response is grounded in feasibility check results	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
ENV 05	E06	Person A	Centralize termination logic	`replicalab/env/replicalab_env.py`	2026-03-08	Added `_check_termination()`: Scientist accept with existing protocol OR max_rounds. Lab Manager accept does NOT auto-terminate.	Episode terminates on agreement or round limit	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
ENV 06	E06	Person A	Wire real judge scoring	`replicalab/env/replicalab_env.py`, `tests/test_env.py`	2026-03-07	Terminal accept steps call `build_reward_breakdown()` and `compute_total_reward()` with real rigor/feasibility/fidelity scores stored in `EpisodeState`. Terminal-without-agreement path now distinguishes `timeout` (max rounds) from `no_agreement` verdict. Four new tests in `TestEnvReward` verify agreement-terminal breakdown/notes/verdict, no-agreement determinism, timeout verdict, and state-stored component scores.	Final step returns total reward, breakdown info, and deterministic penalties or bonuses; verdict distinguishes timeout from no_agreement	Yes - verified with `python -m pytest tests/test_env.py` (36 tests pass) and `python -m pytest` (178 tests pass)
ENV 07	E06	Person A	Implement state() deep snapshot	`replicalab/env/replicalab_env.py`	2026-03-08	`state()` now returns `self._state.model_copy(deep=True)` so callers get an independent snapshot. Two tests verify mutation isolation.	State snapshot is independent of env internals	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
ENV 08	E06	Person A	Implement close() with lifecycle guard	`replicalab/env/replicalab_env.py`	2026-03-08	Added `_closed` flag, idempotent `close()`, `_ensure_open()` guard on `step()`, and `reset()` reopens a closed env. Three tests verify idempotency, step-after-close raises, and reset-reopens.	Close frees resources and does not throw; step after close raises	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
ENV 10	E06	Person A	Add reset, step, invalid action, timeout, and deterministic replay tests	`tests/test_env.py`	2026-03-08	Added a dedicated replay-determinism regression block that verifies same seed plus same actions yields the same initial observation, step trajectory, timeout terminal path, invalid-action behavior, and audit payload across math, ML, and finance families. The new coverage keeps replay deterministic without depending on file-backed logging.	Tests pass for seeded reset, valid step, invalid step, timeout, and replay consistency across supported scenario families	Yes - verified with `python -m pytest tests/test_env.py` (56 tests pass) and full suite (`327 passed`)
ENV 11	E06	Person A	Attach judge audit payload to final `StepResult`, terminal observations, and replay state	`replicalab/models.py`, `replicalab/env/replicalab_env.py`, `server/app.py`, `tests/test_env.py`, `tests/test_server.py`	2026-03-08	Added `top_failure_reasons` to `StepInfo`, `EpisodeState`, and `EpisodeLog`; terminal env steps now build a canonical audit via `build_judge_audit(...)`; and replay log construction now persists `top_failure_reasons` from terminal `StepResult.info` instead of dropping them. Seven env tests cover terminal audit behavior and a replay test verifies the audit reasons survive into `GET /replay/{episode_id}` payloads.	Completed episodes expose audit notes alongside reward breakdown in a stable schema across env state and replay	Yes - verified with `python -m pytest tests/test_env.py` (43 tests pass), `python -m pytest tests/test_server.py` (37 tests pass), and full suite (`314 passed`)
OBS 04	E10	Person A	Add deterministic replay test using seed and action sequence	`tests/test_env.py`	2026-03-08	Closed the observability-side replay guard by reusing the new seeded replay-determinism suite in `TestReplayDeterminism`, which verifies same-seed same-action trajectories, timeout replay determinism, invalid-action replay determinism, and stable terminal audit payloads across all three scenario families.	Replay of the same seed and action sequence matches the prior state sequence deterministically	Yes - verified with `python -m pytest tests/test_env.py` (56 tests pass) and full suite (`327 passed`)
TST 01	E11	Person A	Add reset returns valid observations test	`tests/test_env.py`	2026-03-08	Eight tests in `TestReset` class covering both roles populated, scientist fields, lab manager fields, booked/out-of-stock preservation, state round zero, episode ID, clearing previous episode, and all templates/difficulties.	Test confirms both roles receive valid structured observations	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
TST 02	E11	Person A	Add valid action step test	`tests/test_env.py`	2026-03-08	Eight tests in `TestStep` class covering round advancement, observation shape, conversation history, accept termination, real reward scores, max round termination, step info fields, and full propose-then-accept episode.	Valid action advances round and returns correct shape	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
TST 03	E11	Person A	Add invalid action handling test	`tests/test_env.py`	2026-03-08	Four tests in `TestInvalidAction` class covering error string on invalid duration, env survival after error, no round advancement on invalid action, and request_info always passes.	Invalid action yields structured error and env survives	Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass)
TST 04	E11	Person A	Add perfect protocol high reward test	`tests/test_reward.py`	2026-03-08	Added reward-regression coverage proving a fully aligned protocol scores higher than a broken baseline and stays ordered consistently across reruns.	Perfect protocol scores higher than baseline and broken protocol	Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass)
TST 05	E11	Person A	Add zero dimension or penalty behavior test	`tests/test_reward.py`	2026-03-08	Added reward-regression coverage for zero-feasibility collapse, exact penalty subtraction, and zero-floor clamp behavior so timeout and penalty paths lower reward deterministically.	Zero feasibility or timeout lowers reward as expected	Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass)
MOD 08	E02	Person A	Write unit tests for schemas and validators	`tests/test_mod08_schemas.py`	2026-03-08	Created 70 comprehensive unit tests covering all Pydantic model edge cases: ScientistAction (15 tests for all action types, mixed-mode rejection, whitespace stripping, empty/negative field rejection), LabManagerAction (11 tests for all action types, feasible-flag consistency, suggestion-field rules), Protocol (10 tests for boundary values, stripping, extra-field rejection), ConversationEntry (7 tests for null/empty action_type, role validation), RewardBreakdown (9 tests for boundary values, range rejection), Observation (4 tests for both-none, single-role), LabManagerObservation (3 tests for negative fields, stripping), StepInfo (3 tests for extra-field allowance), StepResult (3 tests), EpisodeState (2 tests), EpisodeLog (3 tests for failure reasons, model_dump keys).	Tests cover valid parse, invalid parse, and replay serialization	Yes - verified with `python -m pytest tests/test_mod08_schemas.py -v` (70 passed) and full suite (409 passed)
API 03	E07	Person C	Add `POST /step` endpoint	`server/app.py`, `tests/test_server.py`	2026-03-07	Fixed `_build_episode_log()` to take the real `StepResult` instead of rebuilding reward data from state with stale stub values. Both REST `/step` and WebSocket step handler now pass the terminal `StepResult` to the updated helper so replay logs use real `reward_breakdown`, `judge_notes`, and `verdict` (including `timeout` vs `no_agreement`). Added five endpoint tests covering reset-then-step happy path, invalid session ID 404, terminal step with real reward breakdown, semantic invalid action returning 200 with `info.error`, and replay with real judge data.	Step endpoint accepts valid action and returns step result	Yes - verified with `python -m pytest tests/test_server.py` (10 tests pass) and `python -m pytest` (183 tests pass)
API 06	E07	Person C	Add WebSocket session handler with isolated env per connection	`server/app.py`, `tests/test_server.py`	2026-03-07	WebSocket handler at `/ws` supports `reset`, `step`, and `ping` message types with per-connection env isolation, idle timeout, and replay storage on terminal episodes. Twelve WebSocket tests cover ping-pong, reset observation, step result, full episode real reward, invalid JSON, missing action field, invalid action payload, unknown message type, session isolation, semantic invalid action returning `step_ok` with `info.error`, timeout verdict proving real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`.	WebSocket session handler supports reset, step, ping with isolated env per connection and correct replay storage	Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) and `python -m pytest` (195 tests pass)
TST 07	E11	Person C	Add WebSocket session handler tests	`tests/test_server.py`	2026-03-07	Twelve focused WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict, and replay log persistence with real judge data. Tests verify that structurally valid but semantically invalid actions return `step_ok` with `info.error` (not WS error frames), matching the env contract.	WebSocket tests cover happy path, error handling, session isolation, and real-env integration	Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass)
API 02	E07	Person C	Add `POST /reset` endpoint	`server/app.py`, `tests/test_server.py`	2026-03-08	`/reset` endpoint creates a new env (or closes the prior one when reusing `session_id`), calls `env.reset(...)`, persists env, `last_active`, and `episode_id` in the in-memory REST session store, and returns `session_id`, `episode_id`, `observation`. Seven dedicated tests cover response shape, both-role observation, explicit session_id reuse, prior-env close on reuse, default params, all scenario/difficulty combos, and seed determinism.	Reset endpoint starts a new episode and returns initial observation	Yes - verified with `python -m pytest tests/test_server.py` (29 tests pass) and `python -m pytest` (202 tests pass)
API 04	E07	Person C	Add `GET /scenarios` endpoint	`server/app.py`, `tests/test_server.py`	2026-03-08	`GET /scenarios` returns the `available_scenario_families()` output through the typed `ScenariosResponse` model. Five focused tests cover status code, response shape, all three scenario families, the expected `easy`, `medium`, and `hard` difficulties, and the absence of extra keys.	Endpoint lists available scenario families and difficulties	Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass)
API 07	E07	Person C	Add idle timeout and graceful disconnect cleanup	`server/app.py`, `tests/test_server.py`	2026-03-08	Verified the existing WebSocket idle-timeout and disconnect cleanup path with two focused tests: one monkeypatches the idle timeout to 0.5s and confirms the server closes with code 1000 when no message arrives, and one wraps `_make_env()` to confirm `env.close()` is called exactly once from the `finally` block on disconnect.	Stale connections close cleanly and the environment closes without leak	Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass)
API 13	E07	Person C	Add CORS middleware configuration for frontend origins in dev and production	`server/app.py`, `tests/test_server.py`	2026-03-08	Confirmed the existing FastAPI CORS middleware allows the local Vite frontend origin plus `https://*.hf.space`, and added three explicit preflight tests covering localhost allowance, HF Space allowance, and disallowed-origin rejection.	Frontend on localhost:5173 and HF Space origin can reach the API without CORS errors	Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass)
API 08	E07	Person C	Build Dockerfile with Python app startup on port 7860	`server/Dockerfile`, `Dockerfile`, `server/requirements.txt`, `docs/max/deployment.md`	2026-03-08	Fixed editable install (`-e .` → `. --no-deps`) in both `server/Dockerfile` and root `Dockerfile`, added `httpx` and `websocket-client` to `server/requirements.txt` (required by `replicalab.client`), rebuilt without cache. Verified Docker container starts with the real env (`"env":"real"`), and all four endpoints work: `GET /health`, `GET /scenarios`, `POST /reset`, `POST /step`. Added verified endpoint commands to `docs/max/deployment.md`.	Local Docker run serves app on port 7860	Yes - verified with `docker build -f server/Dockerfile -t replicalab . && docker run -p 7860:7860 replicalab` and curl against all four endpoints
API 09	E07	Person C	Add Hugging Face Space metadata and deploy instructions	`README.md`, `Dockerfile`, `docs/max/deployment.md`	2026-03-08	Added the Hugging Face Spaces YAML frontmatter to the root README, created the root-level `Dockerfile` required by the Docker SDK, and documented Space creation, git remote setup, push, logs, and secret management in `docs/max/deployment.md`.	Space config is valid for Docker app deployment	Yes - verified against HF Spaces Docker deployment requirements
API 15	E07	Person C	Create HF Space README.md with YAML frontmatter	`README.md`	2026-03-08	Added the required Spaces frontmatter fields (`sdk: docker`, `app_port: 7860`, title, emoji, colors, pinned) to the root README so Hugging Face parses the Space metadata correctly on push.	HF Space config is valid and Space launches correctly from the metadata	Yes - verified against the HF Spaces frontmatter schema
API 14	E07	Person C	Add REST session management so each user gets isolated environment state	`tests/test_api_rest_isolation.py`	2026-03-08	Created 11 dedicated REST session isolation tests in a standalone file covering: two resets produce different sessions, independent observations across scenarios, stepping one session does not mutate the other, independent round counts, terminal isolation, session_id reuse creates new episode and resets rounds, reuse does not affect other sessions, 404 on nonexistent session, step-after-terminal behavior, and replay isolation between sessions. No server changes needed — isolation already works correctly.	Two concurrent REST users do not share or corrupt each other's episode state	Yes - verified with `python -m pytest tests/test_api_rest_isolation.py` (11 tests pass) and full suite (307 passed)
API 10	E07	Person C	Deploy live Space and verify health, reset, and step	`docs/max/deployment.md`, `README.md`	2026-03-08	Verified the live HF Space at `https://ayushozha-replicalab.hf.space` with all four endpoints: `GET /health` (200, env=real), `GET /scenarios` (200, 3 families), `POST /reset` (200, returns session_id/episode_id/observation), `POST /step` (200, returns reward/done/info). Ran a full episode (propose → accept) with real judge scoring: rigor=0.465, feasibility=1.000, fidelity=0.325, total_reward=2.313, verdict=accept. Updated deployment docs and README with verified live URL.	Live Space responds successfully and one end-to-end episode works on the hosted env	Yes - verified with `httpx` requests against `https://ayushozha-replicalab.hf.space`
API 17	E07	Person C	Document secrets and API key management for hosted deployment and Colab	`docs/max/deployment.md`	2026-03-08	Documented that the server is fully self-contained with no external API calls or secrets required. Added secrets reference table for all four contexts (HF Space, local dev, Docker, Colab notebook) with `HF_TOKEN` for model downloads and `REPLICALAB_URL` for hosted env. Documented Colab Secrets panel setup. Added future secrets section for an optional hosted evaluator.	Secrets setup is documented clearly enough for another teammate to reproduce	Yes - verified by inspecting `server/app.py` for env var references (none found) and documenting the complete secrets landscape
JDG 07	E05	Person C	Log reward breakdown to CSV or JSONL per episode	`replicalab/utils/logging.py`, `tests/test_logging.py`	2026-03-08	Verified existing implementation: `append_reward_csv()` writes per-episode rows with all V2 columns (parsimony, bonuses, penalty total, verdict), `append_reward_jsonl()` preserves nested penalty dicts and bounded-tool metrics, and `log_episode_reward()` writes to both formats. 22 tests in `tests/test_logging.py` cover CSV creation, header dedup, JSONL records, default breakdowns, nested penalty preservation, determinism, and the convenience wrapper. No code changes needed.	Reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics	Yes - verified with `python -m pytest tests/test_logging.py -v` (22 passed) and full suite (409 passed)
API 01	E07	Person C	Create FastAPI app shell and health endpoint	`server/app.py`	2026-03-08	Verified the FastAPI app shell is fully functional: `GET /health` returns 200 with `{"status":"ok","env":"real"}`, the app imports and wires `ReplicaLabEnv`, logging is configured via env vars, CORS middleware is active, and all downstream endpoints (reset, step, scenarios, replay, WebSocket) are operational. Server endpoint tests in `tests/test_server.py` (34 tests) and REST isolation tests (11 tests) confirm full coverage. No code changes needed — task was already complete beyond its acceptance criteria.	`GET /health` returns 200 with simple payload	Yes - verified with existing tests and full suite (409 passed)
OBS 02	E10	Person C	Add local log levels and readable console formatting	`replicalab/config.py`, `server/app.py`	2026-03-08	Verified logging already meets all acceptance criteria: `REPLICALAB_LOG_LEVEL` env var toggles log verbosity without code edits (default INFO), `LOG_FORMAT` provides readable `%(asctime)s [%(levelname)s] %(name)s: %(message)s` layout, and `server/app.py` wires both via `logging.basicConfig()`. No code changes needed.	Debug logs can be toggled without code edits	Yes - verified by reading `replicalab/config.py` (lines 30-31) and `server/app.py` (lines 75-79)
ENV 09	E06	Person C	Write episode logs on completion	`server/app.py`	2026-03-08	Added `write_episode_log()` and `log_episode_reward()` calls to `server/app.py` in both REST `/step` and WebSocket step handlers. Terminal episodes now auto-persist replay JSON and reward CSV/JSONL to disk.	Completed episodes generate replayable logs automatically	Yes - verified with terminal episode persistence through REST and WebSocket paths
OBS 09	E10	Person C	Extend episode summary with audit metadata	`replicalab/models.py`	2026-03-08	Added `invalid_action_count` (int) and `invalid_action_rate` (float) fields to `EpisodeLog` in `replicalab/models.py`. Server tracks invalid actions per session and per WebSocket connection.	Every completed episode log contains the audit payload plus demo and evaluation metrics	Yes - verified with model field presence and server-side tracking
OBS 07	E10	Person C	Script to run one episode and dump logs	`scripts/run_episode.py`	2026-03-08	Created `scripts/run_episode.py` that resets the env, runs a baseline propose-then-accept episode, and writes replay JSON plus reward CSV/JSONL.	One command produces a complete local sample log	Yes - verified with script execution producing replay and reward files
TST 11	E11	Person C	Judge audit payload contract tests	`tests/test_audit_contract.py`	2026-03-08	Created `tests/test_audit_contract.py` with 17 tests across 3 classes: `StepInfoAuditContract` (6 tests), `EpisodeLogAuditContract` (6 tests), `AuditModelContracts` (5 tests).	Tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics	Yes - verified with `python -m pytest tests/test_audit_contract.py` (17 tests pass)
API 05	E07	Person C	Add `GET /replay/{episode_id}` endpoint	`server/app.py`	2026-03-08	Already implemented at `server/app.py` line 536-540. Endpoint returns completed episode log JSON for a valid episode ID.	Endpoint returns completed log for valid episode id	Yes - verified with existing replay endpoint tests
API 11	E07	Person C	Add server endpoint tests and WebSocket smoke test	`tests/test_server.py`	2026-03-08	Already implemented in `tests/test_server.py` with 44 tests covering health, reset, step, scenarios, replay, WebSocket connectivity, error handling, session isolation, and smoke paths.	Local server tests pass for health, reset, step, invalid payload, and ws connect	Yes - verified with `python -m pytest tests/test_server.py` (44 tests pass)
API 18	E07	Person C	Include judge audit payload in terminal responses	`server/app.py`	2026-03-08	Already implemented. Terminal `StepInfo` includes `judge_notes`, `verdict`, and `top_failure_reasons` from the real judge audit in both REST and WebSocket paths.	Clients receive judge_notes, verdict fields, and bounded tool audit data without separate log file access	Yes - verified with terminal response inspection and audit contract tests
OBS 01	E10	Person C	Standardize episode log schema	`replicalab/models.py`	2026-03-08	Already implemented. `EpisodeLog` model in `replicalab/models.py` is the canonical schema with all required fields for transcript, state snapshots, scores, and audit metadata.	Every completed episode log contains the same required fields	Yes - verified with `EpisodeLog` model inspection and schema tests
OBS 03	E10	Person C	Episode id generation and file naming conventions	`replicalab/utils/logging.py`	2026-03-08	Already implemented. UUID generation in env, `{episode_id}.json` naming in `replicalab/utils/logging.py`. Logs never overwrite because each episode gets a unique UUID.	Logs never overwrite and are easy to locate	Yes - verified with replay file naming behavior
TST 06	E11	Person C	Health plus reset plus step endpoint tests	`tests/test_server.py`	2026-03-08	Already implemented in `tests/test_server.py` with `TestHealthEndpoint`, `TestResetEndpoint`, and `TestStepEndpoint` classes.	API tests pass locally	Yes - verified with `python -m pytest tests/test_server.py`

Person B (Ayush) - Completed own tasks

ID	Epic	Task	File/Module	Date	What Was Done	Acceptance Criteria	Verified
MOD 09	E02	Add output parser that maps model text to `ScientistAction`	`replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py`	2026-03-08	Added a raw-text parser that extracts JSON from plain output, fenced blocks, or prose-wrapped objects, validates it into `ScientistAction`, and raises explicit `ScientistOutputParseError` values for missing JSON, invalid JSON, or schema failures.	Parser returns structured action or explicit parse error	Yes - verified with `python -m pytest tests/test_scientist_policy.py tests/test_models.py` and a direct `parse_scientist_output(...)` smoke check
SCN 11	E03	Create hand checked golden scenarios for prompt testing	`tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py`	2026-03-08	Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests.	Three fixed scenarios are available for deterministic manual testing	Yes - verified with `python -m pytest tests/test_scenarios.py`
AGT 01	E04	Draft domain-neutral system prompt for Scientist role from normalized scenario data	`replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py`	2026-03-08	Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data.	Prompt clearly explains role, mapped constraints, and JSON output contract	Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check
AGT 02	E04	Build observation to prompt formatting helper from normalized scenario-derived observations	`replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py`	2026-03-08	Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package.	Formatted prompt includes task info, history, and action schema consistently	Yes - verified with `python -m pytest tests/test_scientist_policy.py`
AGT 04	E04	Build baseline heuristic Scientist for non trained smoke tests	`replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py`	2026-03-08	Added `build_baseline_scientist_action(...)`, a deterministic baseline Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly.	Baseline can complete episodes without crashing	Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test
AGT 05	E04	Implement deterministic feasibility checker over normalized constraints and resources	`replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py`	2026-03-08	Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output.	Checker returns clear pass or fail per constraint dimension	Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py`
AGT 06	E04	Implement alternative suggestion logic from allowed substitutions and resource tradeoffs	`replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py`	2026-03-08	Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks.	Lab Manager can suggest at least one sensible revision when the initial plan fails	Yes - verified with `python -m pytest tests/test_lab_manager_policy.py`
AGT 07	E04	Add grounded Lab Manager response synthesis from feasibility results and suggested revisions	`replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py`	2026-03-08	Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text.	Output is readable, grounded in checker results, and maps cleanly to underlying checks	Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check
AGT 11	E04	Select and document base model for Scientist training	`docs/agt11_scientist_model_selection.md`, `README.md`, `docs/training_goals.md`	2026-03-08	Updated the active model decision to use `Qwen/Qwen3.5-9B` as the shared Scientist and Lab Manager base for Northflank H100 runs, with `Qwen/Qwen3.5-4B` as fallback and `Qwen/Qwen3.5-122B-A10B` documented as an audit-only judge candidate.	Decision is recorded and all team members know which model family is being fine tuned	Yes - verified by the decision record, training-goals doc, and README update
AGT 10	E04	Write prompt text files for all three roles with bounded tool rules	`replicalab/prompts/__init__.py`, `replicalab/prompts/scientist.txt`, `replicalab/prompts/lab_manager.txt`, `replicalab/prompts/judge.txt`, `tests/test_prompts.py`	2026-03-08	Added loadable prompt templates and rendering helpers for Scientist, Lab Manager, and Judge, each grounded in normalized scenario data and explicit bounded-tool rules for `search_evidence`, `run_code_check`, and `inspect_image`. Six prompt tests verify loadability, placeholder rendering, domain neutrality, and role-specific bounded-tool guidance.	Prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior	Yes - verified with `python -m pytest tests/test_prompts.py` (6 tests pass) and full suite (`304 passed`)
AGT 03	E04	Add parse plus retry strategy for malformed model output	`replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py`	2026-03-07	Added `call_scientist_with_retry(...)` with error-specific correction prompts, bounded retry loop, and exposed `RetryMetadata` telemetry. Seven focused tests cover first-try success, malformed-then-valid, invalid-then-valid, exhaustion, correction message content, and metadata serialization.	Malformed output triggers at least one controlled retry or explicit failure	Yes - verified with `python -m pytest tests/test_scientist_policy.py` (7 retry tests pass)
AGT 08	E04	Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy	`replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py`	2026-03-07	Added bounded-tool policy block to `build_scientist_system_prompt(...)` naming `search_evidence`, `run_code_check`, and `inspect_image` with explicit rules. Added 24 new tests covering parser happy paths (propose, accept, prose-wrapped), parser edge cases (empty, whitespace, list, extra keys, `to_dict()`), system prompt across all 3 domains plus dict coercion, bounded-tool policy assertions across all domains, role-boundary and output-contract assertions, formatter edge cases (final round, empty-list protocol), and baseline domain inference and forced-accept behavior.	Tests cover happy path, malformed output handling, and stable tool-policy reminders	Yes - verified with `python -m pytest tests/test_scientist_policy.py` (46 tests pass) and `python -m pytest tests/` (111 tests pass)
TRN 13	E08	Create reusable environment client module	`replicalab/client.py`, `tests/test_client.py`	2026-03-08	Added `ReplicaLabClient` with dual transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager support, internal session ID tracking, typed returns mapped to Pydantic models, and constructor-level transport selection. Twenty-four tests cover both transports: connect, reset, step, full episode, replay, context manager, error paths, semantic invalid action handling, and constructor validation.	Client module can be imported by notebook and other consumers without duplicating connection logic	Yes - verified with `python -m pytest tests/test_client.py` (24 tests pass) and `python -m pytest` (231 tests pass)
TRN 03	E08	Implement env client wrapper for training rollouts	`replicalab/training/rollout.py`, `replicalab/training/__init__.py`, `tests/test_rollout.py`	2026-03-08	Added `RolloutWorker` that wraps `ReplicaLabClient` to run full episodes via a user-supplied `PolicyFn` callback, collects typed `StepRecord` trajectories with observations, actions, and errors, and surfaces terminal `EpisodeRecord` with `total_reward`, `reward_breakdown`, `judge_notes`, `verdict`, and `agreement_reached`. Twelve tests cover baseline rollout completion, reward/breakdown/judge output, determinism, all 3 scenario families, metadata capture, max_steps safety cap, and validation error surfacing.	One local episode can be run start-to-finish through the wrapper with no duplicated HTTP/WS code	Yes - verified with `python -m pytest tests/test_rollout.py` (12 tests pass) and `python -m pytest` (264 tests pass)
TRN 04	E08	Implement rollout collection loop for Scientist episodes	`replicalab/training/rollout.py`, `replicalab/training/__init__.py`, `tests/test_rollout.py`, `tests/test_rollout_traces.py`	2026-03-08	Extended the rollout worker to collect full trajectory records with terminal `StepInfo`, bounded tool traces, and batched rollout support via `collect_rollouts(...)`. Added trace-focused tests that verify tool-trace capture from `StepInfo` extras and one-record-per-seed batch collection.	Loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs	Yes - verified with `python -m pytest tests/test_rollout.py tests/test_rollout_traces.py` (14 tests pass) and full suite (`304 passed`)
TRN 01	E08	Create notebook skeleton	`notebooks/train_colab.ipynb`	2026-03-08	Added a judged-path training notebook with explicit setup, evidence preview, Scientist plan preview, Lab Manager plan preview, gated real-training cell, baseline evaluation cell, and Northflank runtime notes so the flow is readable without hiding logic in notebook-only cells.	Notebook has clear runnable sections in the right order and documents the bounded-tool policy	Yes - verified with notebook JSON load, preview-plan execution, and `python -m pytest tests/test_training_cli.py`
TRN 02	E08	Add package install and model setup cell	`notebooks/train_colab.ipynb`, `replicalab/training/runtime.py`, `pyproject.toml`	2026-03-08	Added a fresh-runtime install cell that installs the repo plus `unsloth`, `unsloth_zoo`, `trl`, `vllm`, `datasets`, and `matplotlib`, then added runtime helpers and the `replicalab-train` entrypoint so the same model-loading path works in notebooks and Northflank jobs.	Notebook installs dependencies without manual edits beyond secrets	Yes - verified with notebook inspection and `python -m pytest tests/test_training_cli.py`
TRN 14	E08	Select and document base model (notebook side)	`docs/agt11_scientist_model_selection.md`, `README.md`, `notebooks/train_colab.ipynb`, `docs/training_goals.md`	2026-03-08	Updated the active model decision to `Qwen/Qwen3.5-9B` as the primary shared base for Scientist GRPO and Lab Manager SFT on Northflank H100, kept `Qwen/Qwen3.5-4B` as the reduced-scale fallback, and documented `Qwen/Qwen3.5-122B-A10B` as an audit-only judge candidate.	Base model choice is documented and all team members know which model family is being trained	Yes - verified by the decision record and README update; notebook defaults remain the smaller sponsor-facing path where appropriate
JDG 10	E05	Expose component metrics for training plots	`replicalab/training/metrics.py`, `replicalab/training/plots.py`, `replicalab/training/cli.py`, `tests/test_training_metrics.py`, `docs/training_goals.md`	2026-03-08	Extended the evaluation and metrics layer to expose average rigor, feasibility, fidelity, parsimony, tool-trace volume, invalid bounded-tool rate, paper understanding, and communication quality, then wired those metrics into saved before-vs-after plots plus shared cross-run benchmark history plots.	Notebook and CLI can read the core quality metrics over time, including paper understanding and communication	Yes - verified with `python -m pytest tests/test_training_metrics.py tests/test_training_cli.py` and generated plot artifacts
TRN 05	E08	Connect rollouts to GRPO or equivalent trainer	`replicalab/training/art_openenv.py`, `replicalab/training/cli.py`, `tests/test_training_cli.py`, `replicalab/outputs/art-training/`	2026-03-08	Added the ART/OpenEnv Scientist training path, converting live ReplicaLab episodes plus frozen evidence packs into ART trajectory groups and executing successful live training updates against the hosted environment.	At least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs	Yes - verified with live `art-scientist-train` runs including `art-scientist-smoke-20260308` and `art-scientist-live-20260308-main`
TRN 06	E08	Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics	`replicalab/training/metrics.py`, `replicalab/training/art_openenv.py`, `replicalab/training/cli.py`	2026-03-08	Added structured episode metric exports covering reward, component scores, rounds used, agreement, parse errors, invalid actions, and invalid bounded-tool rates to JSONL and summary artifacts.	Notebook stores a metrics frame across training episodes including bounded tool metrics	Yes - verified with `reports/metrics.jsonl` outputs from ART training and comparison runs
TRN 07	E08	Plot reward curve and component curves with matplotlib	`replicalab/training/plots.py`, `replicalab/training/cli.py`, `replicalab/outputs/art-training/`	2026-03-08	Added saved matplotlib plotting for training-history curves, per-step ART reward-component plots, and comparison bar charts for reward, agreement, invalid actions, and invalid bounded-tool rate.	Plotted image shows visible metrics and can be saved to file	Yes - verified with saved images including `art_reward_components.png` and the `compare_*.png` outputs
TRN 08	E08	Add before versus after evaluation on fixed seeds and frozen evidence packs	`replicalab/training/evaluation.py`, `replicalab/training/cli.py`, `replicalab/agents/scientist_policy.py`	2026-03-08	Added policy-comparison evaluation on fixed seeds and frozen evidence packs, then exercised it against the deterministic baseline and trained ART Scientist checkpoints.	Notebook compares baseline and trained policy on the same scenarios and evidence packs	Yes - verified with `scientist-compare-eval` runs including `art-scientist-compare-smoke-20260308` and `art-scientist-compare-20260308-step5`
TRN 09	E08	Add policy loading path for trained adapter or checkpoint	`replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py`	2026-03-08	Added remote trained-policy loading for ART checkpoints, including evidence-pack-aware prompt assembly and parser-driven retry, so evaluation can switch cleanly between baseline and trained Scientist policies.	Evaluation can switch between baseline and trained model cleanly	Yes - verified with live `scientist-compare-eval` runs against explicit ART checkpoint steps
TRN 10	E08	Export plot image and sample logs to `outputs/plots`	`replicalab/training/cli.py`, `replicalab/outputs/art-training/`, `replicalab/outputs/training/`	2026-03-08	Wired the CLI to save training plots, comparison plots, metrics JSONL, summaries, manifests, and run metadata into stable output directories for README and demo reuse.	Plots are saved and versioned for README use	Yes - verified with generated plot and report artifacts under `replicalab/outputs/art-training/` and `replicalab/outputs/training/`
TRN 15	E08	Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs	`replicalab/training/metrics.py`, `replicalab/training/evaluation.py`, `replicalab/training/cli.py`, `tests/test_training_metrics.py`	2026-03-08	Added aggregate agreement, invalid-action, and invalid bounded-tool metrics across evaluation cases, surfaced them in summaries, and plotted them for before-vs-after comparisons.	Notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs	Yes - verified with comparison summaries and plots from the ART evaluation runs
OBS 06	E10	Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy	`replicalab/training/cli.py`, `replicalab/outputs/art-training/*/reports/run_metadata.json`	2026-03-08	Added reproducibility metadata exports for every training and evaluation command, including base model, scenario set, checkpoint step, evidence-pack version, and bounded-tool policy.	Notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy	Yes - verified with generated `run_metadata.json` files in training and comparison smoke runs
TST 09	E11	Create notebook smoke test for fresh runtime	`docs/ayush/notebook_smoke_test.md`, `replicalab/outputs/training/`, `replicalab/outputs/art-training/`	2026-03-08	Wrote the fresh-runtime smoke checklist and then executed the preview, live ART training, and comparison-eval commands end to end against frozen evidence packs and the hosted ReplicaLab environment.	Training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs	Yes - verified with `scientist-preview-smoke-20260308b`, `lab-manager-preview-smoke-20260308b`, `art-scientist-smoke-20260308b`, and `art-scientist-compare-smoke-20260308b`

Kush (Person D) - Completed on behalf of others

ID	Epic	Assigned To	Task	File/Module	Date	What Was Done	Acceptance Criteria	Verified
FND 03	E01	Max (Person C)	Initialize React plus Vite frontend shell	`frontend/package.json`, `frontend/src/`, `frontend/public/`	2026-03-08	Imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, component library, assets, and TypeScript config.	`npm install` and dev server run successfully	Yes - verified with `npm --prefix frontend install` and `npm --prefix frontend run build`
FND 12	E01	Max (Person C)	Create Vite config with API and WebSocket proxy support plus stable build output settings	`frontend/vite.config.ts`	2026-03-08	Imported Kush's Vite configuration with `@` alias plus `/api` and `/ws` proxy rules, then verified the frontend builds successfully against that config on `ayush`.	Frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging	Yes - verified with `npm --prefix frontend run build`

Shared Tasks - Completed

ID	Epic	Owners	Task	Status
FND 08	E01	Person A and B	Freeze JSON contract for actions and observations	Completed

Max (Person C) - Completed own task

ID	Epic	Task	Status
FND 11	E01	Create `server/requirements.txt` pinning runtime dependencies	Completed

Kush (Person D) - Completed own tasks

ID	Epic	Task	Status
FND 13	E01	Tailwind v4.2 + theme tokens + light/dark mode	Completed
UI 01	E09	App shell with three-panel layout	Completed
UI 02	E09	PaperPanel	Completed
UI 03	E09	ProtocolPanel with DiffRow	Completed
UI 04	E09	NegotiationLog with character avatars	Completed
UI 05	E09	ScorePanel with rigor/feasibility/fidelity bars	Completed
UI 06	E09	Controls (scenario selector, seed input, difficulty)	Completed
UI 07	E09	REST + WebSocket API client (api.ts)	Completed
UI 08	E09	ReplayViewer with range slider	Completed
UI 09	E09	TrainingResults with LineChart	Completed
UI 10	E09	Styling, animations, 3D lab scene	Completed
UI 11	E09	Multi-stage Docker, SPA serving	Completed
UI 13	E09	JudgeAuditPanel with verdict display	Completed
UI 14	E09	Replay scrubber with skip buttons	Completed
UI 15	E09	Before vs after training toggle	Completed
JDG 09	E05	Mock score cards for frontend	Completed
OBS 05	E10	Episode ID + copy-to-clipboard in UI	Completed

What Completing These Tasks Unblocked

Completed Task	Directly Unblocked
FND 01	FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10
FND 02	FND 11
FND 03	FND 12, FND 13, UI 01
FND 04	FND 08, FND 09
FND 05	No downstream dependencies
FND 06	DOC 01
FND 07	No downstream dependencies
FND 08	MOD 01, MOD 02, MOD 03, MOD 12, SCN 01
FND 09	OpenEnv registration layer is now present for later `/web` and deployment work
FND 10	No downstream dependencies
FND 11	No new formal dependencies, but server scaffold work can now install from a standalone requirements file
FND 12	Frontend dev proxying is now configured for local API and WebSocket work
MOD 01	MOD 05, MOD 09
MOD 02	No new formal dependencies, but the Lab Manager contract is now stable for later policy work
MOD 03	MOD 04, MOD 11
MOD 04	MOD 07, ENV 01
MOD 05	MOD 06, AGT 05
MOD 11	No new formal dependency edge by itself, but `StepResult` metadata is now stable for environment, API, replay, and training consumers
MOD 12	Shared defaults now come from `replicalab/config.py`, reducing config drift before environment and scoring work expands
SCN 01	SCN 09 now has a deterministic seed utility to build on
SCN 02	SCN 03, SCN 04, SCN 05, SCN 07
SCN 03	SCN 06, SCN 08
SCN 04	SCN 06, SCN 08
SCN 05	SCN 06, SCN 08
SCN 06	Harder scenario variants and curriculum-ready difficulty scaling now exist
SCN 07	`AGT 05` is complete; `AGT 06`, `AGT 07`, and `JDG 02` are now unblocked from the normalized resource layer
SCN 08	`AGT 06` is now unblocked; `JDG 01` and `JDG 03` are also unblocked
SCN 09	SCN 10, SCN 11, ENV 01, ENV 02
SCN 10	Scenario determinism and consistency now have regression coverage
SCN 11	AGT 01, TRN 08
MOD 09	Together with completed `AGT 02`, `AGT 03` is now unblocked
AGT 01	AGT 02, AGT 11, TRN 04
AGT 02	AGT 03, AGT 04
AGT 04	Removes the last baseline-policy blocker; `AGT 08` now only waits on `AGT 03`
AGT 05	AGT 06, AGT 07, JDG 02
AGT 06	No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against
AGT 07	`AGT 10` is now unblocked, and the stub server now emits grounded Lab Manager responses instead of placeholder review text
AGT 10	Prompt templates now exist for all three roles with bounded tool rules and normalized scenario rendering, reducing prompt drift between notebooks, demos, and future model calls
AGT 11	No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs
ENV 01	ENV 02, ENV 08, and the real-environment import path that partial server tasks now depend on
JDG 01	Together with JDG 02 and JDG 03, unblocks JDG 04 (total reward formula)
JDG 02	Together with JDG 01 and JDG 03, unblocks JDG 04 (total reward formula)
JDG 03	Together with JDG 01 and JDG 02, unblocks JDG 04 (total reward formula)
JDG 04	JDG 05, JDG 08, TST 04, TST 05
JDG 05	JDG 06, JDG 07, JDG 09, JDG 10, JDG 11, ENV 06
JDG 06	AGT 10, JDG 11
ENV 02	ENV 03, ENV 07, ENV 10, TST 01, API 02 (partial → full)
ENV 03	ENV 04, ENV 05, TST 02, TST 03
ENV 04	ENV 05, TST 02
ENV 05	ENV 06, TST 02
ENV 06	ENV 07, ENV 09, ENV 11, API 03 (partial → full), API 06 (partial → full), OBS 07
API 06	TRN 03, TRN 13
API 09	API 10, API 17
TST 07	No new dependencies
ENV 07	ENV 10 (partial unblock)
ENV 08	API 07 (partial → full)
TST 01	No new dependencies
TST 02	No new dependencies
TST 03	No new dependencies
API 02	API 14, UI 06
TRN 13	TRN 03 now has both its dependencies met (API 06 + TRN 13)
TRN 03	TRN 01 (Colab notebook skeleton), TRN 04 (reward shaping for GRPO)
TRN 04	TRN 05 (trainer integration) and partial unblock for TRN 06 (metrics logging once JDG 10 exists)
API 08	API 09, API 16, API 19
MOD 06	Partial unblock for MOD 08 (unit tests for schemas and validators, depends on MOD 01–07)
MOD 07	MOD 08, JDG 07
MOD 10	Frontend and notebook consumers now share canonical schema examples generated from the current contracts
SCN 13	No new formal dependency edge by itself, but deterministic booking and scheduling conflicts are now present in the normalized scenario pack for later environment, judge, and UI work
AGT 09	No new formal dependency edge by itself, but the grounded Lab Manager checker/suggestion/response stack now has deterministic regression coverage
JDG 11	ENV 11 (attach audit to terminal StepResult), UI 13 (render audit in frontend), OBS 09 (extend episode summary with audit)
ENV 11	No new fully unblocked tasks by itself; `API 18` and `OBS 09` are each one dependency closer because the audit payload now survives into replay-facing state
API 10	TRN 01 (Colab notebook skeleton), TRN 11 (environment URL documentation)
API 17	No new formal dependency edge by itself, but secrets landscape is now documented for all contexts
ENV 09	OBS 01, API 05
OBS 01	OBS 03, OBS 07
OBS 03	No downstream dependencies beyond OBS 07 which is also complete
OBS 07	No downstream dependencies
OBS 09	TRN 15 is one dependency closer (still needs TRN 06 and TRN 08)
API 05	UI 08, OBS 05
API 11	No downstream dependencies
API 18	TST 11, UI 13
TST 06	No downstream dependencies
TST 11	No downstream dependencies

Current Unblocked and Active Tasks

All 152 tasks are complete. No tasks remain.

Epic Progress

Epic	Total Tasks	Completed	Rate
E01. Foundations and repository setup	13	13	100.00%
E02. Domain models, validation, state contracts	12	12	100.00%
E03. Scenario engine and constraint generation	13	13	100.00%
E04. Scientist agent and Lab Manager policy	11	11	100.00%
E05. Judge engine and reward logic	11	11	100.00%
E06. OpenEnv environment implementation	11	11	100.00%
E07. API, server, Docker, deployment	19	19	100.00%
E08. RL training pipeline and evaluation	15	15	100.00%
E09. Frontend, UX, replay, demo views	15	15	100.00%
E10. Logging, replay, and observability	9	9	100.00%
E11. Testing and quality gates	12	12	100.00%
E12. README, demo video, submission packaging	11	11	100.00%