replicalab / docs /completion.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

ReplicaLab Task Completion Tracker

Source of truth: ReplicaLab_Comprehensive_Task_Division.md


Working Governance Files

File Role
AGENTS.md Required startup and close-out rules for contributors and automated model agents
docs/project_management_rules.md Detailed project-management workflow
docs/changes.md Append-only deviation log
docs/<owner>/ Owner-local task and planning docs

Overall Completion

Metric Value
Total tasks 152
Completed 152
Partial / active 0
Remaining 0
Completion rate 100.00%

Post-MVP benchmark note:

  • On 2026-03-09, a live Northflank H100 first-step benchmark was added as an operational post-MVP artifact under replicalab/outputs/training/h100-one-step-500-20260309/.
  • It covers 500 total simulations (250 shared reset cases Γ— baseline and trained first-step actions) and records paper-understanding regression data for the current saved Scientist adapter.

Completion by Person

Person Assigned Completed (own) Completed (by others) Remaining Rate
Kian (Person A) 49 (47 solo + 2 shared with B) 1 shared sign-off (FND 08) 48 (FND 04, FND 09, MOD 01, MOD 02, MOD 03, MOD 04, MOD 05, MOD 06, MOD 08, MOD 11, MOD 12, SCN 01 to SCN 10, SCN 13, AGT 05, AGT 09, ENV 01 to ENV 08, ENV 10, ENV 11, JDG 01 to JDG 06, JDG 08, JDG 11, OBS 04, TST 01 to TST 05 done by Person B) 0 100.00%
Person B (Ayush) 29 (27 solo + 2 shared with A) 29 (FND 08, MOD 09, SCN 11, AGT 01, AGT 02, AGT 03, AGT 04, AGT 05, AGT 06, AGT 07, AGT 08, AGT 10, AGT 11, JDG 10, TRN 01 to TRN 10, TRN 13, TRN 14, TRN 15, OBS 06, TST 09) 0 0 100.00%
Max (Person C) 41 1 (FND 11) 40 (done by Person B or Person D; API 16, UI 11 by Kush) 0 100.00%
Kush (Person D) 32 17 (FND 13, UI 01-UI 06, UI 07-UI 09, UI 10, UI 11, UI 13-UI 15, JDG 09, OBS 05) 15 (by Person B: FND 06, SCN 12, API 12, TRN 12, UI 12, OBS 08, TST 08, TST 12, DOC 01-DOC 07, DOC 09, DOC 11) 0 100%
All (shared) 3 3 (FND 08, AGT 05, TST 10) 0 0 100.00%

All 152 tasks are now complete (100%). Every person's lane is closed:

  • Kian (Person A): 49/49 (done by Person B)
  • Ayush (Person B): 29/29
  • Max (Person C): 41/41 (done by Person B and Kush)
  • Kush (Person D): 32/32 (17 by Kush, 15 by Person B)
  • Shared: 3/3

Active Partial Tasks

ID Assigned To Current Status Remaining Acceptance Item
β€” β€” No active partial tasks β€”

Completed Tasks

Person B (Ayush) - Completed on behalf of others

ID Epic Assigned To Task File/Module Date What Was Done Acceptance Criteria Verified
FND 01 E01 Person C Create repo structure and base folders from agreed layout repo root 2026-03-07 Created the full repo scaffold: replicalab/ with subdirectories for agents/, env/, prompts/, scenarios/, scoring/, utils/; server/; frontend/ with src/components/ and src/pages/; notebooks/; tests/. All directories tracked via .gitkeep files. All top level folders exist and repo clones cleanly Yes
FND 02 E01 Person C Add Python project config and dependencies placeholder pyproject.toml 2026-03-08 Added a PEP 621 pyproject.toml with package metadata, Python 3.10+ requirement, runtime dependencies (pydantic, fastapi, uvicorn, websockets), dev extras (pytest, pytest-cov, ruff, mypy), package discovery, and pytest test-path settings. Project installs locally without missing package errors for base modules Yes - verified with python -m pip install -e ., python -m pip install -e ".[dev]", and python -c "from replicalab.models import ..."
FND 04 E01 Person A Add empty Pydantic models and shared type names replicalab/models.py 2026-03-08 Created replicalab/__init__.py and replicalab/models.py with the shared action, observation, step, state, and log stubs. Import paths resolve for all placeholder models Yes - verified with python -c "from replicalab.models import ..."
FND 05 E01 Person C Add ignore rules for Python, Node, logs, notebooks, and build artifacts .gitignore, .dockerignore 2026-03-08 Added .dockerignore and expanded .gitignore for caches, coverage artifacts, notebook checkpoints, frontend build files, and generated outputs while preserving tracked .gitkeep files. Repo status stays clean after local run and build, and Docker build excludes non-runtime files Yes
FND 06 E01 Person D Add temporary project stub with title, mission, team roles, and local setup placeholder README.md 2026-03-08 Replaced the aspirational README with a temporary foundation stub that reflects the current repo state, mission, ownership, and verified setup placeholder. New contributor can understand repo purpose in under two minutes Yes
FND 07 E01 Person C Define branch naming, PR template, and issue template .github/ and repo workflow docs 2026-03-08 Added .github/pull_request_template.md and .github/ISSUE_TEMPLATE/task.yml, and documented preferred branch naming patterns plus required tracking-doc updates in docs/project_management_rules.md. All future PRs auto show the template and issue fields Yes
FND 09 E01 Person A Create OpenEnv configuration file specifying environment class, action and observation types, and server settings openenv.yaml, pyproject.toml, server/app.py, uv.lock 2026-03-08 Added openenv.yaml, recorded the environment and contract metadata for OpenEnv, added openenv-core plus a server script entry point to pyproject.toml, added main() to server/app.py, and generated uv.lock so the repo passes local OpenEnv validation. OpenEnv can discover and serve the environment using this config file Yes - verified with uv lock and openenv validate
FND 10 E01 Person C Create output directory structure replicalab/outputs/ 2026-03-07 Created replicalab/outputs/ with three subdirectories: logs/, replays/, and plots/, all tracked via .gitkeep files. Output directories exist and generated files are not committed to git Yes
MOD 01 E02 Person A Implement ScientistAction schema replicalab/models.py, tests/test_models.py, server/app.py 2026-03-08 Replaced the ScientistAction stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so accept preserves the current protocol. Valid scientist actions parse and invalid fields raise validation errors Yes - verified with python -m pytest tests/test_models.py and a stub-env ScientistAction.model_validate(...) smoke step
MOD 02 E02 Person A Implement LabManagerAction schema replicalab/models.py, tests/test_models.py 2026-03-08 Replaced the LabManagerAction stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside suggest_alternative, and added focused validation tests. Valid lab manager actions parse and invalid fields raise validation errors Yes - verified with python -m pytest tests/test_models.py
MOD 03 E02 Person A Implement role specific observation models replicalab/models.py, tests/test_models.py, server/app.py 2026-03-08 Added typed ConversationEntry and Protocol models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. Scientist and lab observations serialize to JSON with stable keys Yes - verified with python -m pytest tests/test_models.py and a stub reset() / step() JSON smoke test
MOD 04 E02 Person A Implement EpisodeState and EpisodeLog models replicalab/models.py, server/app.py, tests/test_models.py 2026-03-08 Replaced the remaining loose dict state and replay fields with typed Protocol, ConversationEntry, and RewardBreakdown models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs. Full state round trip serialize plus deserialize works Yes - verified with python -m pytest tests/test_models.py
MOD 05 E02 Person A Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab replicalab/utils/validation.py, tests/test_models.py, tests/test_scenarios.py 2026-03-08 Added deterministic semantic protocol validation with ValidationResult and validate_protocol(...) checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack. Invalid protocol examples are rejected with readable reasons Yes - verified with python -m pytest tests/test_models.py tests/test_scenarios.py
MOD 06 E02 Person A Add semantic validators for impossible plans such as zero sample size with positive controls replicalab/utils/validation.py, tests/test_validation.py 2026-03-08 Added _check_semantic_impossibilities() with five checks: zero sample with controls (error), controls >= sample size (error), duplicate controls (warning), duplicate equipment (warning), duplicate reagents (warning). Seven new tests cover all cases plus a regression guard confirming valid protocols still pass. Semantic validator catches at least five invalid edge cases Yes - verified with python -m pytest tests/test_validation.py (20 tests pass) and full suite (223 passed)
MOD 07 E02 Person C Add state serialization helper for replay logs replicalab/utils/logging.py, tests/test_logging.py 2026-03-08 Added file-based replay persistence helpers with atomic JSON writes (write_episode_log, load_episode_log) plus CSV reward logging (append_reward_csv). Eleven tests cover lossless round-trip, filename behavior, nested directory creation, transcript and reward-breakdown preservation, CSV headers, append semantics, missing-file errors, and default output targets. State logs can be written and loaded without loss Yes - verified with python -m pytest tests/test_logging.py (11 tests pass)
MOD 10 E02 Person C Publish schema examples for frontend and notebook clients tests/fixtures/generate_api_examples.py, tests/fixtures/api_schema_examples.json 2026-03-08 Added a deterministic generator that builds canonical REST and WebSocket example payloads from real Pydantic models and seeded scenario data, then writes a shared api_schema_examples.json fixture for frontend and notebook consumers. The generated examples now use the current deterministic judge metadata instead of stale stub text. Frontend and notebook can mock against shared sample payloads Yes - verified with python tests/fixtures/generate_api_examples.py and fixture review
MOD 11 E02 Person A Implement StepResult model replicalab/models.py, server/app.py, tests/test_models.py 2026-03-08 Added typed RewardBreakdown and StepInfo models, upgraded StepResult.info to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly. Step result serializes cleanly and all consumers agree on its shape Yes - verified with python -m pytest tests/test_models.py
MOD 12 E02 Person A Create environment configuration module with shared constants replicalab/config.py, server/app.py, replicalab/scenarios/*.py, tests/test_config.py 2026-03-08 Added a shared configuration module for default scenario and difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, then updated the server and scenario builders to import those constants instead of repeating literals. All modules import config from one place and no magic numbers remain in env or scoring code Yes - verified with python -m pytest tests/test_config.py tests/test_scenarios.py
SCN 01 E03 Person A Implement deterministic RNG helper seed_rng() replicalab/utils/seed.py, replicalab/scenarios/templates.py 2026-03-08 Added deterministic seed helpers that derive reproducible RNG namespaces for scenario generation. Same seed always yields the same random choices and the seed utility is importable from scenarios and env Yes - verified with python -m pytest tests/test_scenarios.py
SCN 02 E03 Person A Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec replicalab/scenarios/templates.py 2026-03-08 Added NormalizedScenarioPack plus strict ScenarioConstraint, ScenarioResource, AllowedSubstitution, and HiddenReferenceSpec models to standardize all scenario families. All scenario builders return the same normalized top-level structure and mapper-ready inputs Yes - verified with python -m pytest tests/test_scenarios.py
SCN 03 E03 Person A Implement mathematics template replicalab/scenarios/math_reasoning.py 2026-03-08 Added deterministic mathematics planning templates covering theorem, proof-goal, review, and time constraints. Generated scenario passes structure and internal consistency tests Yes - verified with python -m pytest tests/test_scenarios.py
SCN 04 E03 Person A Implement ML benchmark template replicalab/scenarios/ml_benchmark.py 2026-03-08 Added deterministic ML benchmark templates covering dataset, compute, time, and evaluation constraints. Generated scenario passes structure and internal consistency tests Yes - verified with python -m pytest tests/test_scenarios.py
SCN 05 E03 Person A Implement finance and trading planning template replicalab/scenarios/finance_trading.py 2026-03-08 Added deterministic offline finance and trading planning templates covering capital, drawdown, slippage, and backtest rules. Generated scenario passes structure and internal consistency tests Yes - verified with python -m pytest tests/test_scenarios.py
SCN 06 E03 Person A Implement difficulty application for easy, medium, hard replicalab/scenarios/templates.py, tests/test_scenarios.py 2026-03-08 Added mechanical difficulty scaling that adjusts budgets, time, staff, resource availability, and injected conflict constraints across easy, medium, and hard. Difficulty visibly changes the normalized scenario pack in a meaningful way Yes - verified with python -m pytest tests/test_scenarios.py
SCN 07 E03 Person A Implement normalized constraint and resource generator replicalab/scenarios/templates.py, tests/test_scenarios.py 2026-03-08 Added normalized constraint and resource mapping into role-specific observations with consistency checks for unique keys and non-contradictory generated packs. No generated scenario contains contradictory constraints or resources Yes - verified with python -m pytest tests/test_scenarios.py
SCN 08 E03 Person A Implement hidden reference spec and allowed substitutions per template replicalab/scenarios/templates.py, tests/test_scenarios.py 2026-03-08 Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. Hidden reference clearly marks what is fixed versus flexible for deterministic scoring Yes - verified with python -m pytest tests/test_scenarios.py
SCN 09 E03 Person A Implement generate_scenario(seed, template, difficulty) replicalab/scenarios/templates.py, server/app.py, tests/test_scenarios.py 2026-03-08 Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. Function returns a full scenario with deterministic content Yes - verified with python -m pytest tests/test_scenarios.py and a _StubEnv.reset(...) smoke test
SCN 10 E03 Person A Add seeded generation tests and consistency tests tests/test_scenarios.py 2026-03-08 Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. Same seed plus template returns the same scenario and different seeds vary Yes - verified with python -m pytest tests/test_scenarios.py
SCN 13 E03 Person A Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration replicalab/scenarios/templates.py, replicalab/scenarios/__init__.py, tests/test_scenarios.py 2026-03-08 Added typed ResourceBooking and SchedulingWindow models, extended NormalizedScenarioPack with deterministic booking and scheduling data, wired seeded generation into the scenario builder across all three domains, and added five scenario tests covering determinism, easy-mode no-conflict behavior, JSON round-trip, valid windows, and domain coverage. Constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability Yes - verified with python -m pytest tests/test_scenarios.py (13 tests pass) and full suite (304 passed)
AGT 09 E04 Person A Add deterministic feasibility checker tests for Lab Manager grounding tests/test_lab_manager_policy.py 2026-03-08 Added seventeen deterministic regression tests covering check_feasibility(...), suggest_alternative(...), and compose_lab_manager_response(...) across all three domains, including repeated-run determinism, substitution ordering, duration and sample-size revision stability, never-worsens checks, action-type branching, flag mirroring, and explanation stability. Same proposal plus same normalized scenario returns the same checker results every time Yes - verified with python -m pytest tests/test_lab_manager_policy.py (37 tests pass) and full suite (304 passed)
ENV 01 E06 Person A Create ReplicaLabEnv class skeleton replicalab/env/replicalab_env.py, replicalab/env/__init__.py 2026-03-08 Added a real ReplicaLabEnv module as a drop-in replacement for the former in-server stub, ported the working stub behavior into the environment package, wired scenario-pack-backed reset or step or state or close methods with follow-on TODO(ENV XX) markers, and removed the old stub-only marker from StepInfo payloads. Environment class imports and instantiates without runtime errors Yes - verified with a direct ReplicaLabEnv.reset(...) -> step(...) -> state() -> close() smoke run and python -m pytest (111 passed)
JDG 01 E05 Person A Implement rigor or objective-validity score replicalab/scoring/rigor.py, replicalab/utils/text.py, tests/test_reward.py 2026-03-08 Added score_rigor(protocol, scenario) with weighted sub-scores for structural completeness (0.30), success criteria coverage (0.40), and required element coverage (0.30). Uses shared element_tokens from replicalab/utils/text.py. Five focused tests in test_reward.py cover quality ordering, determinism, controls impact, rationale length, and all-domain range validation. Score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning Yes - verified with python -m pytest tests/test_reward.py (18 tests pass)
JDG 02 E05 Person A Implement feasibility score replicalab/scoring/feasibility.py, tests/test_reward.py 2026-03-08 Added score_feasibility(protocol, scenario, check=None) that derives a continuous [0,1] signal from FeasibilityCheckResult (AGT 05). Seven dimensions weighted equally (1/7) with partial credit for budget, equipment, reagents, and staff. Accepts optional pre-computed check to avoid redundant work. Six focused tests cover viable protocol, infeasible ordering, pre-computed check equivalence, determinism, partial credit, and all-domain range. Score is between 0 and 1 and matches normalized constraint logic Yes - verified with python -m pytest tests/test_reward.py (18 tests pass)
JDG 03 E05 Person A Implement fidelity score replicalab/scoring/fidelity.py, tests/test_reward.py 2026-03-08 Added score_fidelity(protocol, scenario) with substitution-aware scoring: required element coverage (0.50, direct match=1.0, substitution=0.7), flexible element alignment (0.20, bonus only), target metric alignment (0.20), and technique appropriateness (0.10). Five focused tests cover aligned vs misaligned ordering, determinism, substitution partial credit, target metric impact, and all-domain range. Score is between 0 and 1 and matches rubric examples for plan and evidence alignment Yes - verified with python -m pytest tests/test_reward.py (18 tests pass)
JDG 04 E05 Person A Implement total reward formula replicalab/scoring/rubric.py, tests/test_reward.py 2026-03-07 compute_total_reward(breakdown) implements 10 Γ— rigor Γ— feasibility Γ— fidelity + bonuses βˆ’ penalties with max(0.0, ...) floor clamp. Eight new tests in test_reward.py verify perfect-vs-broken ordering, zero-feasibility behavior, efficiency bonus ordering, exact penalty subtraction, zero-clamp floor, determinism, external penalties injection, and default-empty penalties. Seven existing rubric tests in test_env.py also cover the formula. Total reward formula matches agreed math, clamps at zero, and returns consistent output for plan quality and bounded tool behavior Yes - verified with python -m pytest tests/test_reward.py (26 tests pass) and python -m pytest tests/test_env.py (36 tests pass)
JDG 05 E05 Person A Build reward breakdown object replicalab/scoring/rubric.py, replicalab/scoring/__init__.py, tests/test_reward.py 2026-03-07 build_reward_breakdown(...) accepts an optional penalties: dict[str, float] parameter for named penalty keys (e.g. invalid_tool_use, unsupported_claim) from bounded-tool diagnostics without reopening the model contract. Returns a typed RewardBreakdown with rigor, feasibility, fidelity, efficiency_bonus, communication_bonus, and penalties dict. Exported through replicalab.scoring. Breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics extension point Yes - verified with python -m pytest tests/test_reward.py (26 tests pass) and python -m pytest tests/test_env.py (36 tests pass)
JDG 06 E05 Person A Add optional plain English explanation function from reward breakdown replicalab/scoring/explain.py, replicalab/scoring/__init__.py, tests/test_reward.py 2026-03-08 Added explain_reward(...), a deterministic explanation builder that mirrors rigor, feasibility, fidelity, bonuses, penalties, and total reward with stable quality-tier labels and without introducing any new scoring logic. Exported through replicalab.scoring and covered by nine focused tests. Explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic Yes - verified with python -m pytest tests/test_reward.py (40 tests pass)
JDG 08 E05 Person A Add score determinism tests and edge case tests tests/test_reward.py 2026-03-08 Added six focused regression tests covering good-vs-awful ordering across all judge axes and total reward, success-criteria sensitivity in rigor scoring, partial equipment credit ordering in feasibility scoring, direct-match vs allowed-substitution vs miss ordering in fidelity scoring, and reward-breakdown determinism with and without a precomputed feasibility check. Perfect and broken protocols produce expected relative ordering and scoring remains deterministic across edge cases Yes - verified with python -m pytest tests/test_reward.py (40 tests pass) and python -m pytest -q (264 passed)
JDG 11 E05 Person A Add structured final audit payload with judge_notes, verdict, and top failure reasons replicalab/agents/judge_policy.py, replicalab/agents/__init__.py, tests/test_judge_policy.py 2026-03-08 Created JudgeAudit model and build_judge_audit() builder that derives verdict (accept/timeout/no_agreement), reuses explain_reward() for judge_notes, and extracts top failure reasons from weak rubric components and penalty keys. Exported through replicalab.agents. Ten tests cover all three verdict paths, component-driven failure reasons, penalty surfacing, reason cap, good-protocol empty reasons, determinism, and JSON round-trip. Final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI Yes - verified with python -m pytest tests/test_judge_policy.py (10 tests pass) and full suite (255 passed)
ENV 02 E06 Person A Implement real reset wiring replicalab/env/replicalab_env.py 2026-03-08 _make_observation() now uses the scenario pack as source of truth for booked/out-of-stock/safety data instead of empty placeholders. Eight reset tests verify both roles populated, booked/out-of-stock preserved, all templates and difficulties. Reset returns initial observations with full scenario data Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
ENV 03 E06 Person A Implement Scientist turn with validation replicalab/env/replicalab_env.py 2026-03-08 Added _validate_scientist_action() that runs validate_protocol() on proposals and returns structured error strings without crashing the env. Invalid actions don't advance the round. Valid action updates state, invalid action returns structured error Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
ENV 04 E06 Person A Implement Lab Manager response step replicalab/env/replicalab_env.py 2026-03-08 _lab_manager_action() uses the full grounded pipeline: check_feasibility() β†’ suggest_alternative() β†’ compose_lab_manager_response(). Lab Manager response is grounded in feasibility check results Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
ENV 05 E06 Person A Centralize termination logic replicalab/env/replicalab_env.py 2026-03-08 Added _check_termination(): Scientist accept with existing protocol OR max_rounds. Lab Manager accept does NOT auto-terminate. Episode terminates on agreement or round limit Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
ENV 06 E06 Person A Wire real judge scoring replicalab/env/replicalab_env.py, tests/test_env.py 2026-03-07 Terminal accept steps call build_reward_breakdown() and compute_total_reward() with real rigor/feasibility/fidelity scores stored in EpisodeState. Terminal-without-agreement path now distinguishes timeout (max rounds) from no_agreement verdict. Four new tests in TestEnvReward verify agreement-terminal breakdown/notes/verdict, no-agreement determinism, timeout verdict, and state-stored component scores. Final step returns total reward, breakdown info, and deterministic penalties or bonuses; verdict distinguishes timeout from no_agreement Yes - verified with python -m pytest tests/test_env.py (36 tests pass) and python -m pytest (178 tests pass)
ENV 07 E06 Person A Implement state() deep snapshot replicalab/env/replicalab_env.py 2026-03-08 state() now returns self._state.model_copy(deep=True) so callers get an independent snapshot. Two tests verify mutation isolation. State snapshot is independent of env internals Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
ENV 08 E06 Person A Implement close() with lifecycle guard replicalab/env/replicalab_env.py 2026-03-08 Added _closed flag, idempotent close(), _ensure_open() guard on step(), and reset() reopens a closed env. Three tests verify idempotency, step-after-close raises, and reset-reopens. Close frees resources and does not throw; step after close raises Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
ENV 10 E06 Person A Add reset, step, invalid action, timeout, and deterministic replay tests tests/test_env.py 2026-03-08 Added a dedicated replay-determinism regression block that verifies same seed plus same actions yields the same initial observation, step trajectory, timeout terminal path, invalid-action behavior, and audit payload across math, ML, and finance families. The new coverage keeps replay deterministic without depending on file-backed logging. Tests pass for seeded reset, valid step, invalid step, timeout, and replay consistency across supported scenario families Yes - verified with python -m pytest tests/test_env.py (56 tests pass) and full suite (327 passed)
ENV 11 E06 Person A Attach judge audit payload to final StepResult, terminal observations, and replay state replicalab/models.py, replicalab/env/replicalab_env.py, server/app.py, tests/test_env.py, tests/test_server.py 2026-03-08 Added top_failure_reasons to StepInfo, EpisodeState, and EpisodeLog; terminal env steps now build a canonical audit via build_judge_audit(...); and replay log construction now persists top_failure_reasons from terminal StepResult.info instead of dropping them. Seven env tests cover terminal audit behavior and a replay test verifies the audit reasons survive into GET /replay/{episode_id} payloads. Completed episodes expose audit notes alongside reward breakdown in a stable schema across env state and replay Yes - verified with python -m pytest tests/test_env.py (43 tests pass), python -m pytest tests/test_server.py (37 tests pass), and full suite (314 passed)
OBS 04 E10 Person A Add deterministic replay test using seed and action sequence tests/test_env.py 2026-03-08 Closed the observability-side replay guard by reusing the new seeded replay-determinism suite in TestReplayDeterminism, which verifies same-seed same-action trajectories, timeout replay determinism, invalid-action replay determinism, and stable terminal audit payloads across all three scenario families. Replay of the same seed and action sequence matches the prior state sequence deterministically Yes - verified with python -m pytest tests/test_env.py (56 tests pass) and full suite (327 passed)
TST 01 E11 Person A Add reset returns valid observations test tests/test_env.py 2026-03-08 Eight tests in TestReset class covering both roles populated, scientist fields, lab manager fields, booked/out-of-stock preservation, state round zero, episode ID, clearing previous episode, and all templates/difficulties. Test confirms both roles receive valid structured observations Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
TST 02 E11 Person A Add valid action step test tests/test_env.py 2026-03-08 Eight tests in TestStep class covering round advancement, observation shape, conversation history, accept termination, real reward scores, max round termination, step info fields, and full propose-then-accept episode. Valid action advances round and returns correct shape Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
TST 03 E11 Person A Add invalid action handling test tests/test_env.py 2026-03-08 Four tests in TestInvalidAction class covering error string on invalid duration, env survival after error, no round advancement on invalid action, and request_info always passes. Invalid action yields structured error and env survives Yes - verified with python -m pytest tests/test_env.py (32 tests pass)
TST 04 E11 Person A Add perfect protocol high reward test tests/test_reward.py 2026-03-08 Added reward-regression coverage proving a fully aligned protocol scores higher than a broken baseline and stays ordered consistently across reruns. Perfect protocol scores higher than baseline and broken protocol Yes - verified with python -m pytest tests/test_reward.py (26 tests pass)
TST 05 E11 Person A Add zero dimension or penalty behavior test tests/test_reward.py 2026-03-08 Added reward-regression coverage for zero-feasibility collapse, exact penalty subtraction, and zero-floor clamp behavior so timeout and penalty paths lower reward deterministically. Zero feasibility or timeout lowers reward as expected Yes - verified with python -m pytest tests/test_reward.py (26 tests pass)
MOD 08 E02 Person A Write unit tests for schemas and validators tests/test_mod08_schemas.py 2026-03-08 Created 70 comprehensive unit tests covering all Pydantic model edge cases: ScientistAction (15 tests for all action types, mixed-mode rejection, whitespace stripping, empty/negative field rejection), LabManagerAction (11 tests for all action types, feasible-flag consistency, suggestion-field rules), Protocol (10 tests for boundary values, stripping, extra-field rejection), ConversationEntry (7 tests for null/empty action_type, role validation), RewardBreakdown (9 tests for boundary values, range rejection), Observation (4 tests for both-none, single-role), LabManagerObservation (3 tests for negative fields, stripping), StepInfo (3 tests for extra-field allowance), StepResult (3 tests), EpisodeState (2 tests), EpisodeLog (3 tests for failure reasons, model_dump keys). Tests cover valid parse, invalid parse, and replay serialization Yes - verified with python -m pytest tests/test_mod08_schemas.py -v (70 passed) and full suite (409 passed)
API 03 E07 Person C Add POST /step endpoint server/app.py, tests/test_server.py 2026-03-07 Fixed _build_episode_log() to take the real StepResult instead of rebuilding reward data from state with stale stub values. Both REST /step and WebSocket step handler now pass the terminal StepResult to the updated helper so replay logs use real reward_breakdown, judge_notes, and verdict (including timeout vs no_agreement). Added five endpoint tests covering reset-then-step happy path, invalid session ID 404, terminal step with real reward breakdown, semantic invalid action returning 200 with info.error, and replay with real judge data. Step endpoint accepts valid action and returns step result Yes - verified with python -m pytest tests/test_server.py (10 tests pass) and python -m pytest (183 tests pass)
API 06 E07 Person C Add WebSocket session handler with isolated env per connection server/app.py, tests/test_server.py 2026-03-07 WebSocket handler at /ws supports reset, step, and ping message types with per-connection env isolation, idle timeout, and replay storage on terminal episodes. Twelve WebSocket tests cover ping-pong, reset observation, step result, full episode real reward, invalid JSON, missing action field, invalid action payload, unknown message type, session isolation, semantic invalid action returning step_ok with info.error, timeout verdict proving real-env integration, and terminal episode replay persistence via GET /replay/{episode_id}. WebSocket session handler supports reset, step, ping with isolated env per connection and correct replay storage Yes - verified with python -m pytest tests/test_server.py (22 tests pass) and python -m pytest (195 tests pass)
TST 07 E11 Person C Add WebSocket session handler tests tests/test_server.py 2026-03-07 Twelve focused WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict, and replay log persistence with real judge data. Tests verify that structurally valid but semantically invalid actions return step_ok with info.error (not WS error frames), matching the env contract. WebSocket tests cover happy path, error handling, session isolation, and real-env integration Yes - verified with python -m pytest tests/test_server.py (22 tests pass)
API 02 E07 Person C Add POST /reset endpoint server/app.py, tests/test_server.py 2026-03-08 /reset endpoint creates a new env (or closes the prior one when reusing session_id), calls env.reset(...), persists env, last_active, and episode_id in the in-memory REST session store, and returns session_id, episode_id, observation. Seven dedicated tests cover response shape, both-role observation, explicit session_id reuse, prior-env close on reuse, default params, all scenario/difficulty combos, and seed determinism. Reset endpoint starts a new episode and returns initial observation Yes - verified with python -m pytest tests/test_server.py (29 tests pass) and python -m pytest (202 tests pass)
API 04 E07 Person C Add GET /scenarios endpoint server/app.py, tests/test_server.py 2026-03-08 GET /scenarios returns the available_scenario_families() output through the typed ScenariosResponse model. Five focused tests cover status code, response shape, all three scenario families, the expected easy, medium, and hard difficulties, and the absence of extra keys. Endpoint lists available scenario families and difficulties Yes - verified with python -m pytest tests/test_server.py -v (34 tests pass)
API 07 E07 Person C Add idle timeout and graceful disconnect cleanup server/app.py, tests/test_server.py 2026-03-08 Verified the existing WebSocket idle-timeout and disconnect cleanup path with two focused tests: one monkeypatches the idle timeout to 0.5s and confirms the server closes with code 1000 when no message arrives, and one wraps _make_env() to confirm env.close() is called exactly once from the finally block on disconnect. Stale connections close cleanly and the environment closes without leak Yes - verified with python -m pytest tests/test_server.py -v (34 tests pass)
API 13 E07 Person C Add CORS middleware configuration for frontend origins in dev and production server/app.py, tests/test_server.py 2026-03-08 Confirmed the existing FastAPI CORS middleware allows the local Vite frontend origin plus https://*.hf.space, and added three explicit preflight tests covering localhost allowance, HF Space allowance, and disallowed-origin rejection. Frontend on localhost:5173 and HF Space origin can reach the API without CORS errors Yes - verified with python -m pytest tests/test_server.py -v (34 tests pass)
API 08 E07 Person C Build Dockerfile with Python app startup on port 7860 server/Dockerfile, Dockerfile, server/requirements.txt, docs/max/deployment.md 2026-03-08 Fixed editable install (-e . β†’ . --no-deps) in both server/Dockerfile and root Dockerfile, added httpx and websocket-client to server/requirements.txt (required by replicalab.client), rebuilt without cache. Verified Docker container starts with the real env ("env":"real"), and all four endpoints work: GET /health, GET /scenarios, POST /reset, POST /step. Added verified endpoint commands to docs/max/deployment.md. Local Docker run serves app on port 7860 Yes - verified with docker build -f server/Dockerfile -t replicalab . && docker run -p 7860:7860 replicalab and curl against all four endpoints
API 09 E07 Person C Add Hugging Face Space metadata and deploy instructions README.md, Dockerfile, docs/max/deployment.md 2026-03-08 Added the Hugging Face Spaces YAML frontmatter to the root README, created the root-level Dockerfile required by the Docker SDK, and documented Space creation, git remote setup, push, logs, and secret management in docs/max/deployment.md. Space config is valid for Docker app deployment Yes - verified against HF Spaces Docker deployment requirements
API 15 E07 Person C Create HF Space README.md with YAML frontmatter README.md 2026-03-08 Added the required Spaces frontmatter fields (sdk: docker, app_port: 7860, title, emoji, colors, pinned) to the root README so Hugging Face parses the Space metadata correctly on push. HF Space config is valid and Space launches correctly from the metadata Yes - verified against the HF Spaces frontmatter schema
API 14 E07 Person C Add REST session management so each user gets isolated environment state tests/test_api_rest_isolation.py 2026-03-08 Created 11 dedicated REST session isolation tests in a standalone file covering: two resets produce different sessions, independent observations across scenarios, stepping one session does not mutate the other, independent round counts, terminal isolation, session_id reuse creates new episode and resets rounds, reuse does not affect other sessions, 404 on nonexistent session, step-after-terminal behavior, and replay isolation between sessions. No server changes needed β€” isolation already works correctly. Two concurrent REST users do not share or corrupt each other's episode state Yes - verified with python -m pytest tests/test_api_rest_isolation.py (11 tests pass) and full suite (307 passed)
API 10 E07 Person C Deploy live Space and verify health, reset, and step docs/max/deployment.md, README.md 2026-03-08 Verified the live HF Space at https://ayushozha-replicalab.hf.space with all four endpoints: GET /health (200, env=real), GET /scenarios (200, 3 families), POST /reset (200, returns session_id/episode_id/observation), POST /step (200, returns reward/done/info). Ran a full episode (propose β†’ accept) with real judge scoring: rigor=0.465, feasibility=1.000, fidelity=0.325, total_reward=2.313, verdict=accept. Updated deployment docs and README with verified live URL. Live Space responds successfully and one end-to-end episode works on the hosted env Yes - verified with httpx requests against https://ayushozha-replicalab.hf.space
API 17 E07 Person C Document secrets and API key management for hosted deployment and Colab docs/max/deployment.md 2026-03-08 Documented that the server is fully self-contained with no external API calls or secrets required. Added secrets reference table for all four contexts (HF Space, local dev, Docker, Colab notebook) with HF_TOKEN for model downloads and REPLICALAB_URL for hosted env. Documented Colab Secrets panel setup. Added future secrets section for an optional hosted evaluator. Secrets setup is documented clearly enough for another teammate to reproduce Yes - verified by inspecting server/app.py for env var references (none found) and documenting the complete secrets landscape
JDG 07 E05 Person C Log reward breakdown to CSV or JSONL per episode replicalab/utils/logging.py, tests/test_logging.py 2026-03-08 Verified existing implementation: append_reward_csv() writes per-episode rows with all V2 columns (parsimony, bonuses, penalty total, verdict), append_reward_jsonl() preserves nested penalty dicts and bounded-tool metrics, and log_episode_reward() writes to both formats. 22 tests in tests/test_logging.py cover CSV creation, header dedup, JSONL records, default breakdowns, nested penalty preservation, determinism, and the convenience wrapper. No code changes needed. Reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics Yes - verified with python -m pytest tests/test_logging.py -v (22 passed) and full suite (409 passed)
API 01 E07 Person C Create FastAPI app shell and health endpoint server/app.py 2026-03-08 Verified the FastAPI app shell is fully functional: GET /health returns 200 with {"status":"ok","env":"real"}, the app imports and wires ReplicaLabEnv, logging is configured via env vars, CORS middleware is active, and all downstream endpoints (reset, step, scenarios, replay, WebSocket) are operational. Server endpoint tests in tests/test_server.py (34 tests) and REST isolation tests (11 tests) confirm full coverage. No code changes needed β€” task was already complete beyond its acceptance criteria. GET /health returns 200 with simple payload Yes - verified with existing tests and full suite (409 passed)
OBS 02 E10 Person C Add local log levels and readable console formatting replicalab/config.py, server/app.py 2026-03-08 Verified logging already meets all acceptance criteria: REPLICALAB_LOG_LEVEL env var toggles log verbosity without code edits (default INFO), LOG_FORMAT provides readable %(asctime)s [%(levelname)s] %(name)s: %(message)s layout, and server/app.py wires both via logging.basicConfig(). No code changes needed. Debug logs can be toggled without code edits Yes - verified by reading replicalab/config.py (lines 30-31) and server/app.py (lines 75-79)
ENV 09 E06 Person C Write episode logs on completion server/app.py 2026-03-08 Added write_episode_log() and log_episode_reward() calls to server/app.py in both REST /step and WebSocket step handlers. Terminal episodes now auto-persist replay JSON and reward CSV/JSONL to disk. Completed episodes generate replayable logs automatically Yes - verified with terminal episode persistence through REST and WebSocket paths
OBS 09 E10 Person C Extend episode summary with audit metadata replicalab/models.py 2026-03-08 Added invalid_action_count (int) and invalid_action_rate (float) fields to EpisodeLog in replicalab/models.py. Server tracks invalid actions per session and per WebSocket connection. Every completed episode log contains the audit payload plus demo and evaluation metrics Yes - verified with model field presence and server-side tracking
OBS 07 E10 Person C Script to run one episode and dump logs scripts/run_episode.py 2026-03-08 Created scripts/run_episode.py that resets the env, runs a baseline propose-then-accept episode, and writes replay JSON plus reward CSV/JSONL. One command produces a complete local sample log Yes - verified with script execution producing replay and reward files
TST 11 E11 Person C Judge audit payload contract tests tests/test_audit_contract.py 2026-03-08 Created tests/test_audit_contract.py with 17 tests across 3 classes: StepInfoAuditContract (6 tests), EpisodeLogAuditContract (6 tests), AuditModelContracts (5 tests). Tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics Yes - verified with python -m pytest tests/test_audit_contract.py (17 tests pass)
API 05 E07 Person C Add GET /replay/{episode_id} endpoint server/app.py 2026-03-08 Already implemented at server/app.py line 536-540. Endpoint returns completed episode log JSON for a valid episode ID. Endpoint returns completed log for valid episode id Yes - verified with existing replay endpoint tests
API 11 E07 Person C Add server endpoint tests and WebSocket smoke test tests/test_server.py 2026-03-08 Already implemented in tests/test_server.py with 44 tests covering health, reset, step, scenarios, replay, WebSocket connectivity, error handling, session isolation, and smoke paths. Local server tests pass for health, reset, step, invalid payload, and ws connect Yes - verified with python -m pytest tests/test_server.py (44 tests pass)
API 18 E07 Person C Include judge audit payload in terminal responses server/app.py 2026-03-08 Already implemented. Terminal StepInfo includes judge_notes, verdict, and top_failure_reasons from the real judge audit in both REST and WebSocket paths. Clients receive judge_notes, verdict fields, and bounded tool audit data without separate log file access Yes - verified with terminal response inspection and audit contract tests
OBS 01 E10 Person C Standardize episode log schema replicalab/models.py 2026-03-08 Already implemented. EpisodeLog model in replicalab/models.py is the canonical schema with all required fields for transcript, state snapshots, scores, and audit metadata. Every completed episode log contains the same required fields Yes - verified with EpisodeLog model inspection and schema tests
OBS 03 E10 Person C Episode id generation and file naming conventions replicalab/utils/logging.py 2026-03-08 Already implemented. UUID generation in env, {episode_id}.json naming in replicalab/utils/logging.py. Logs never overwrite because each episode gets a unique UUID. Logs never overwrite and are easy to locate Yes - verified with replay file naming behavior
TST 06 E11 Person C Health plus reset plus step endpoint tests tests/test_server.py 2026-03-08 Already implemented in tests/test_server.py with TestHealthEndpoint, TestResetEndpoint, and TestStepEndpoint classes. API tests pass locally Yes - verified with python -m pytest tests/test_server.py

Person B (Ayush) - Completed own tasks

ID Epic Task File/Module Date What Was Done Acceptance Criteria Verified
MOD 09 E02 Add output parser that maps model text to ScientistAction replicalab/agents/scientist_policy.py, replicalab/agents/__init__.py, tests/test_scientist_policy.py 2026-03-08 Added a raw-text parser that extracts JSON from plain output, fenced blocks, or prose-wrapped objects, validates it into ScientistAction, and raises explicit ScientistOutputParseError values for missing JSON, invalid JSON, or schema failures. Parser returns structured action or explicit parse error Yes - verified with python -m pytest tests/test_scientist_policy.py tests/test_models.py and a direct parse_scientist_output(...) smoke check
SCN 11 E03 Create hand checked golden scenarios for prompt testing tests/fixtures/golden_scenarios.json, tests/test_scenarios.py 2026-03-08 Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. Three fixed scenarios are available for deterministic manual testing Yes - verified with python -m pytest tests/test_scenarios.py
AGT 01 E04 Draft domain-neutral system prompt for Scientist role from normalized scenario data replicalab/agents/scientist_policy.py, tests/test_scientist_policy.py 2026-03-08 Added build_scientist_system_prompt(...) to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. Prompt clearly explains role, mapped constraints, and JSON output contract Yes - verified with python -m pytest tests/test_scientist_policy.py and a direct prompt-build smoke check
AGT 02 E04 Build observation to prompt formatting helper from normalized scenario-derived observations replicalab/agents/scientist_policy.py, replicalab/agents/__init__.py, tests/test_scientist_policy.py 2026-03-08 Added format_scientist_observation(...) to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. Formatted prompt includes task info, history, and action schema consistently Yes - verified with python -m pytest tests/test_scientist_policy.py
AGT 04 E04 Build baseline heuristic Scientist for non trained smoke tests replicalab/agents/scientist_policy.py, replicalab/agents/__init__.py, tests/test_scientist_policy.py 2026-03-08 Added build_baseline_scientist_action(...), a deterministic baseline Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. Baseline can complete episodes without crashing Yes - verified with python -m pytest tests/test_scientist_policy.py including a stub-env episode smoke test
AGT 05 E04 Implement deterministic feasibility checker over normalized constraints and resources replicalab/agents/lab_manager_policy.py, replicalab/agents/__init__.py, tests/test_lab_manager_policy.py 2026-03-08 Added a deterministic Lab Manager feasibility checker with a typed FeasibilityCheckResult, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. Checker returns clear pass or fail per constraint dimension Yes - verified with python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py
AGT 06 E04 Implement alternative suggestion logic from allowed substitutions and resource tradeoffs replicalab/agents/lab_manager_policy.py, replicalab/agents/__init__.py, tests/test_lab_manager_policy.py 2026-03-08 Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed AlternativeSuggestion with applied changes, remaining failures, and pre or post feasibility checks. Lab Manager can suggest at least one sensible revision when the initial plan fails Yes - verified with python -m pytest tests/test_lab_manager_policy.py
AGT 07 E04 Add grounded Lab Manager response synthesis from feasibility results and suggested revisions replicalab/agents/lab_manager_policy.py, replicalab/agents/__init__.py, server/app.py, tests/test_lab_manager_policy.py 2026-03-08 Added compose_lab_manager_response(...), a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed LabManagerAction with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. Output is readable, grounded in checker results, and maps cleanly to underlying checks Yes - verified with python -m pytest tests/test_lab_manager_policy.py and a stub-env step smoke check
AGT 11 E04 Select and document base model for Scientist training docs/agt11_scientist_model_selection.md, README.md, docs/training_goals.md 2026-03-08 Updated the active model decision to use Qwen/Qwen3.5-9B as the shared Scientist and Lab Manager base for Northflank H100 runs, with Qwen/Qwen3.5-4B as fallback and Qwen/Qwen3.5-122B-A10B documented as an audit-only judge candidate. Decision is recorded and all team members know which model family is being fine tuned Yes - verified by the decision record, training-goals doc, and README update
AGT 10 E04 Write prompt text files for all three roles with bounded tool rules replicalab/prompts/__init__.py, replicalab/prompts/scientist.txt, replicalab/prompts/lab_manager.txt, replicalab/prompts/judge.txt, tests/test_prompts.py 2026-03-08 Added loadable prompt templates and rendering helpers for Scientist, Lab Manager, and Judge, each grounded in normalized scenario data and explicit bounded-tool rules for search_evidence, run_code_check, and inspect_image. Six prompt tests verify loadability, placeholder rendering, domain neutrality, and role-specific bounded-tool guidance. Prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior Yes - verified with python -m pytest tests/test_prompts.py (6 tests pass) and full suite (304 passed)
AGT 03 E04 Add parse plus retry strategy for malformed model output replicalab/agents/scientist_policy.py, tests/test_scientist_policy.py 2026-03-07 Added call_scientist_with_retry(...) with error-specific correction prompts, bounded retry loop, and exposed RetryMetadata telemetry. Seven focused tests cover first-try success, malformed-then-valid, invalid-then-valid, exhaustion, correction message content, and metadata serialization. Malformed output triggers at least one controlled retry or explicit failure Yes - verified with python -m pytest tests/test_scientist_policy.py (7 retry tests pass)
AGT 08 E04 Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy replicalab/agents/scientist_policy.py, tests/test_scientist_policy.py 2026-03-07 Added bounded-tool policy block to build_scientist_system_prompt(...) naming search_evidence, run_code_check, and inspect_image with explicit rules. Added 24 new tests covering parser happy paths (propose, accept, prose-wrapped), parser edge cases (empty, whitespace, list, extra keys, to_dict()), system prompt across all 3 domains plus dict coercion, bounded-tool policy assertions across all domains, role-boundary and output-contract assertions, formatter edge cases (final round, empty-list protocol), and baseline domain inference and forced-accept behavior. Tests cover happy path, malformed output handling, and stable tool-policy reminders Yes - verified with python -m pytest tests/test_scientist_policy.py (46 tests pass) and python -m pytest tests/ (111 tests pass)
TRN 13 E08 Create reusable environment client module replicalab/client.py, tests/test_client.py 2026-03-08 Added ReplicaLabClient with dual transport support (REST via httpx, WebSocket via websocket-client), unified sync interface (connect, reset, step, state, close), context manager support, internal session ID tracking, typed returns mapped to Pydantic models, and constructor-level transport selection. Twenty-four tests cover both transports: connect, reset, step, full episode, replay, context manager, error paths, semantic invalid action handling, and constructor validation. Client module can be imported by notebook and other consumers without duplicating connection logic Yes - verified with python -m pytest tests/test_client.py (24 tests pass) and python -m pytest (231 tests pass)
TRN 03 E08 Implement env client wrapper for training rollouts replicalab/training/rollout.py, replicalab/training/__init__.py, tests/test_rollout.py 2026-03-08 Added RolloutWorker that wraps ReplicaLabClient to run full episodes via a user-supplied PolicyFn callback, collects typed StepRecord trajectories with observations, actions, and errors, and surfaces terminal EpisodeRecord with total_reward, reward_breakdown, judge_notes, verdict, and agreement_reached. Twelve tests cover baseline rollout completion, reward/breakdown/judge output, determinism, all 3 scenario families, metadata capture, max_steps safety cap, and validation error surfacing. One local episode can be run start-to-finish through the wrapper with no duplicated HTTP/WS code Yes - verified with python -m pytest tests/test_rollout.py (12 tests pass) and python -m pytest (264 tests pass)
TRN 04 E08 Implement rollout collection loop for Scientist episodes replicalab/training/rollout.py, replicalab/training/__init__.py, tests/test_rollout.py, tests/test_rollout_traces.py 2026-03-08 Extended the rollout worker to collect full trajectory records with terminal StepInfo, bounded tool traces, and batched rollout support via collect_rollouts(...). Added trace-focused tests that verify tool-trace capture from StepInfo extras and one-record-per-seed batch collection. Loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs Yes - verified with python -m pytest tests/test_rollout.py tests/test_rollout_traces.py (14 tests pass) and full suite (304 passed)
TRN 01 E08 Create notebook skeleton notebooks/train_colab.ipynb 2026-03-08 Added a judged-path training notebook with explicit setup, evidence preview, Scientist plan preview, Lab Manager plan preview, gated real-training cell, baseline evaluation cell, and Northflank runtime notes so the flow is readable without hiding logic in notebook-only cells. Notebook has clear runnable sections in the right order and documents the bounded-tool policy Yes - verified with notebook JSON load, preview-plan execution, and python -m pytest tests/test_training_cli.py
TRN 02 E08 Add package install and model setup cell notebooks/train_colab.ipynb, replicalab/training/runtime.py, pyproject.toml 2026-03-08 Added a fresh-runtime install cell that installs the repo plus unsloth, unsloth_zoo, trl, vllm, datasets, and matplotlib, then added runtime helpers and the replicalab-train entrypoint so the same model-loading path works in notebooks and Northflank jobs. Notebook installs dependencies without manual edits beyond secrets Yes - verified with notebook inspection and python -m pytest tests/test_training_cli.py
TRN 14 E08 Select and document base model (notebook side) docs/agt11_scientist_model_selection.md, README.md, notebooks/train_colab.ipynb, docs/training_goals.md 2026-03-08 Updated the active model decision to Qwen/Qwen3.5-9B as the primary shared base for Scientist GRPO and Lab Manager SFT on Northflank H100, kept Qwen/Qwen3.5-4B as the reduced-scale fallback, and documented Qwen/Qwen3.5-122B-A10B as an audit-only judge candidate. Base model choice is documented and all team members know which model family is being trained Yes - verified by the decision record and README update; notebook defaults remain the smaller sponsor-facing path where appropriate
JDG 10 E05 Expose component metrics for training plots replicalab/training/metrics.py, replicalab/training/plots.py, replicalab/training/cli.py, tests/test_training_metrics.py, docs/training_goals.md 2026-03-08 Extended the evaluation and metrics layer to expose average rigor, feasibility, fidelity, parsimony, tool-trace volume, invalid bounded-tool rate, paper understanding, and communication quality, then wired those metrics into saved before-vs-after plots plus shared cross-run benchmark history plots. Notebook and CLI can read the core quality metrics over time, including paper understanding and communication Yes - verified with python -m pytest tests/test_training_metrics.py tests/test_training_cli.py and generated plot artifacts
TRN 05 E08 Connect rollouts to GRPO or equivalent trainer replicalab/training/art_openenv.py, replicalab/training/cli.py, tests/test_training_cli.py, replicalab/outputs/art-training/ 2026-03-08 Added the ART/OpenEnv Scientist training path, converting live ReplicaLab episodes plus frozen evidence packs into ART trajectory groups and executing successful live training updates against the hosted environment. At least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs Yes - verified with live art-scientist-train runs including art-scientist-smoke-20260308 and art-scientist-live-20260308-main
TRN 06 E08 Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics replicalab/training/metrics.py, replicalab/training/art_openenv.py, replicalab/training/cli.py 2026-03-08 Added structured episode metric exports covering reward, component scores, rounds used, agreement, parse errors, invalid actions, and invalid bounded-tool rates to JSONL and summary artifacts. Notebook stores a metrics frame across training episodes including bounded tool metrics Yes - verified with reports/metrics.jsonl outputs from ART training and comparison runs
TRN 07 E08 Plot reward curve and component curves with matplotlib replicalab/training/plots.py, replicalab/training/cli.py, replicalab/outputs/art-training/ 2026-03-08 Added saved matplotlib plotting for training-history curves, per-step ART reward-component plots, and comparison bar charts for reward, agreement, invalid actions, and invalid bounded-tool rate. Plotted image shows visible metrics and can be saved to file Yes - verified with saved images including art_reward_components.png and the compare_*.png outputs
TRN 08 E08 Add before versus after evaluation on fixed seeds and frozen evidence packs replicalab/training/evaluation.py, replicalab/training/cli.py, replicalab/agents/scientist_policy.py 2026-03-08 Added policy-comparison evaluation on fixed seeds and frozen evidence packs, then exercised it against the deterministic baseline and trained ART Scientist checkpoints. Notebook compares baseline and trained policy on the same scenarios and evidence packs Yes - verified with scientist-compare-eval runs including art-scientist-compare-smoke-20260308 and art-scientist-compare-20260308-step5
TRN 09 E08 Add policy loading path for trained adapter or checkpoint replicalab/agents/scientist_policy.py, replicalab/agents/__init__.py, tests/test_scientist_policy.py 2026-03-08 Added remote trained-policy loading for ART checkpoints, including evidence-pack-aware prompt assembly and parser-driven retry, so evaluation can switch cleanly between baseline and trained Scientist policies. Evaluation can switch between baseline and trained model cleanly Yes - verified with live scientist-compare-eval runs against explicit ART checkpoint steps
TRN 10 E08 Export plot image and sample logs to outputs/plots replicalab/training/cli.py, replicalab/outputs/art-training/, replicalab/outputs/training/ 2026-03-08 Wired the CLI to save training plots, comparison plots, metrics JSONL, summaries, manifests, and run metadata into stable output directories for README and demo reuse. Plots are saved and versioned for README use Yes - verified with generated plot and report artifacts under replicalab/outputs/art-training/ and replicalab/outputs/training/
TRN 15 E08 Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs replicalab/training/metrics.py, replicalab/training/evaluation.py, replicalab/training/cli.py, tests/test_training_metrics.py 2026-03-08 Added aggregate agreement, invalid-action, and invalid bounded-tool metrics across evaluation cases, surfaced them in summaries, and plotted them for before-vs-after comparisons. Notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs Yes - verified with comparison summaries and plots from the ART evaluation runs
OBS 06 E10 Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy replicalab/training/cli.py, replicalab/outputs/art-training/*/reports/run_metadata.json 2026-03-08 Added reproducibility metadata exports for every training and evaluation command, including base model, scenario set, checkpoint step, evidence-pack version, and bounded-tool policy. Notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy Yes - verified with generated run_metadata.json files in training and comparison smoke runs
TST 09 E11 Create notebook smoke test for fresh runtime docs/ayush/notebook_smoke_test.md, replicalab/outputs/training/, replicalab/outputs/art-training/ 2026-03-08 Wrote the fresh-runtime smoke checklist and then executed the preview, live ART training, and comparison-eval commands end to end against frozen evidence packs and the hosted ReplicaLab environment. Training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs Yes - verified with scientist-preview-smoke-20260308b, lab-manager-preview-smoke-20260308b, art-scientist-smoke-20260308b, and art-scientist-compare-smoke-20260308b

Kush (Person D) - Completed on behalf of others

ID Epic Assigned To Task File/Module Date What Was Done Acceptance Criteria Verified
FND 03 E01 Max (Person C) Initialize React plus Vite frontend shell frontend/package.json, frontend/src/, frontend/public/ 2026-03-08 Imported the full React plus Vite frontend tree from Kush's branch onto ayush, including the app shell, pages, component library, assets, and TypeScript config. npm install and dev server run successfully Yes - verified with npm --prefix frontend install and npm --prefix frontend run build
FND 12 E01 Max (Person C) Create Vite config with API and WebSocket proxy support plus stable build output settings frontend/vite.config.ts 2026-03-08 Imported Kush's Vite configuration with @ alias plus /api and /ws proxy rules, then verified the frontend builds successfully against that config on ayush. Frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging Yes - verified with npm --prefix frontend run build

Shared Tasks - Completed

ID Epic Owners Task Status
FND 08 E01 Person A and B Freeze JSON contract for actions and observations Completed

Max (Person C) - Completed own task

ID Epic Task Status
FND 11 E01 Create server/requirements.txt pinning runtime dependencies Completed

Kush (Person D) - Completed own tasks

ID Epic Task Status
FND 13 E01 Tailwind v4.2 + theme tokens + light/dark mode Completed
UI 01 E09 App shell with three-panel layout Completed
UI 02 E09 PaperPanel Completed
UI 03 E09 ProtocolPanel with DiffRow Completed
UI 04 E09 NegotiationLog with character avatars Completed
UI 05 E09 ScorePanel with rigor/feasibility/fidelity bars Completed
UI 06 E09 Controls (scenario selector, seed input, difficulty) Completed
UI 07 E09 REST + WebSocket API client (api.ts) Completed
UI 08 E09 ReplayViewer with range slider Completed
UI 09 E09 TrainingResults with LineChart Completed
UI 10 E09 Styling, animations, 3D lab scene Completed
UI 11 E09 Multi-stage Docker, SPA serving Completed
UI 13 E09 JudgeAuditPanel with verdict display Completed
UI 14 E09 Replay scrubber with skip buttons Completed
UI 15 E09 Before vs after training toggle Completed
JDG 09 E05 Mock score cards for frontend Completed
OBS 05 E10 Episode ID + copy-to-clipboard in UI Completed

What Completing These Tasks Unblocked

Completed Task Directly Unblocked
FND 01 FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10
FND 02 FND 11
FND 03 FND 12, FND 13, UI 01
FND 04 FND 08, FND 09
FND 05 No downstream dependencies
FND 06 DOC 01
FND 07 No downstream dependencies
FND 08 MOD 01, MOD 02, MOD 03, MOD 12, SCN 01
FND 09 OpenEnv registration layer is now present for later /web and deployment work
FND 10 No downstream dependencies
FND 11 No new formal dependencies, but server scaffold work can now install from a standalone requirements file
FND 12 Frontend dev proxying is now configured for local API and WebSocket work
MOD 01 MOD 05, MOD 09
MOD 02 No new formal dependencies, but the Lab Manager contract is now stable for later policy work
MOD 03 MOD 04, MOD 11
MOD 04 MOD 07, ENV 01
MOD 05 MOD 06, AGT 05
MOD 11 No new formal dependency edge by itself, but StepResult metadata is now stable for environment, API, replay, and training consumers
MOD 12 Shared defaults now come from replicalab/config.py, reducing config drift before environment and scoring work expands
SCN 01 SCN 09 now has a deterministic seed utility to build on
SCN 02 SCN 03, SCN 04, SCN 05, SCN 07
SCN 03 SCN 06, SCN 08
SCN 04 SCN 06, SCN 08
SCN 05 SCN 06, SCN 08
SCN 06 Harder scenario variants and curriculum-ready difficulty scaling now exist
SCN 07 AGT 05 is complete; AGT 06, AGT 07, and JDG 02 are now unblocked from the normalized resource layer
SCN 08 AGT 06 is now unblocked; JDG 01 and JDG 03 are also unblocked
SCN 09 SCN 10, SCN 11, ENV 01, ENV 02
SCN 10 Scenario determinism and consistency now have regression coverage
SCN 11 AGT 01, TRN 08
MOD 09 Together with completed AGT 02, AGT 03 is now unblocked
AGT 01 AGT 02, AGT 11, TRN 04
AGT 02 AGT 03, AGT 04
AGT 04 Removes the last baseline-policy blocker; AGT 08 now only waits on AGT 03
AGT 05 AGT 06, AGT 07, JDG 02
AGT 06 No new formal dependency edge by itself, but AGT 07 now has deterministic revision content to narrate and compare against
AGT 07 AGT 10 is now unblocked, and the stub server now emits grounded Lab Manager responses instead of placeholder review text
AGT 10 Prompt templates now exist for all three roles with bounded tool rules and normalized scenario rendering, reducing prompt drift between notebooks, demos, and future model calls
AGT 11 No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs
ENV 01 ENV 02, ENV 08, and the real-environment import path that partial server tasks now depend on
JDG 01 Together with JDG 02 and JDG 03, unblocks JDG 04 (total reward formula)
JDG 02 Together with JDG 01 and JDG 03, unblocks JDG 04 (total reward formula)
JDG 03 Together with JDG 01 and JDG 02, unblocks JDG 04 (total reward formula)
JDG 04 JDG 05, JDG 08, TST 04, TST 05
JDG 05 JDG 06, JDG 07, JDG 09, JDG 10, JDG 11, ENV 06
JDG 06 AGT 10, JDG 11
ENV 02 ENV 03, ENV 07, ENV 10, TST 01, API 02 (partial β†’ full)
ENV 03 ENV 04, ENV 05, TST 02, TST 03
ENV 04 ENV 05, TST 02
ENV 05 ENV 06, TST 02
ENV 06 ENV 07, ENV 09, ENV 11, API 03 (partial β†’ full), API 06 (partial β†’ full), OBS 07
API 06 TRN 03, TRN 13
API 09 API 10, API 17
TST 07 No new dependencies
ENV 07 ENV 10 (partial unblock)
ENV 08 API 07 (partial β†’ full)
TST 01 No new dependencies
TST 02 No new dependencies
TST 03 No new dependencies
API 02 API 14, UI 06
TRN 13 TRN 03 now has both its dependencies met (API 06 + TRN 13)
TRN 03 TRN 01 (Colab notebook skeleton), TRN 04 (reward shaping for GRPO)
TRN 04 TRN 05 (trainer integration) and partial unblock for TRN 06 (metrics logging once JDG 10 exists)
API 08 API 09, API 16, API 19
MOD 06 Partial unblock for MOD 08 (unit tests for schemas and validators, depends on MOD 01–07)
MOD 07 MOD 08, JDG 07
MOD 10 Frontend and notebook consumers now share canonical schema examples generated from the current contracts
SCN 13 No new formal dependency edge by itself, but deterministic booking and scheduling conflicts are now present in the normalized scenario pack for later environment, judge, and UI work
AGT 09 No new formal dependency edge by itself, but the grounded Lab Manager checker/suggestion/response stack now has deterministic regression coverage
JDG 11 ENV 11 (attach audit to terminal StepResult), UI 13 (render audit in frontend), OBS 09 (extend episode summary with audit)
ENV 11 No new fully unblocked tasks by itself; API 18 and OBS 09 are each one dependency closer because the audit payload now survives into replay-facing state
API 10 TRN 01 (Colab notebook skeleton), TRN 11 (environment URL documentation)
API 17 No new formal dependency edge by itself, but secrets landscape is now documented for all contexts
ENV 09 OBS 01, API 05
OBS 01 OBS 03, OBS 07
OBS 03 No downstream dependencies beyond OBS 07 which is also complete
OBS 07 No downstream dependencies
OBS 09 TRN 15 is one dependency closer (still needs TRN 06 and TRN 08)
API 05 UI 08, OBS 05
API 11 No downstream dependencies
API 18 TST 11, UI 13
TST 06 No downstream dependencies
TST 11 No downstream dependencies

Current Unblocked and Active Tasks

All 152 tasks are complete. No tasks remain.


Epic Progress

Epic Total Tasks Completed Rate
E01. Foundations and repository setup 13 13 100.00%
E02. Domain models, validation, state contracts 12 12 100.00%
E03. Scenario engine and constraint generation 13 13 100.00%
E04. Scientist agent and Lab Manager policy 11 11 100.00%
E05. Judge engine and reward logic 11 11 100.00%
E06. OpenEnv environment implementation 11 11 100.00%
E07. API, server, Docker, deployment 19 19 100.00%
E08. RL training pipeline and evaluation 15 15 100.00%
E09. Frontend, UX, replay, demo views 15 15 100.00%
E10. Logging, replay, and observability 9 9 100.00%
E11. Testing and quality gates 12 12 100.00%
E12. README, demo video, submission packaging 11 11 100.00%