| FND 01 |
E01 |
Person C |
Create repo structure and base folders from agreed layout |
repo root |
2026-03-07 |
Created the full repo scaffold: replicalab/ with subdirectories for agents/, env/, prompts/, scenarios/, scoring/, utils/; server/; frontend/ with src/components/ and src/pages/; notebooks/; tests/. All directories tracked via .gitkeep files. |
All top level folders exist and repo clones cleanly |
Yes |
| FND 02 |
E01 |
Person C |
Add Python project config and dependencies placeholder |
pyproject.toml |
2026-03-08 |
Added a PEP 621 pyproject.toml with package metadata, Python 3.10+ requirement, runtime dependencies (pydantic, fastapi, uvicorn, websockets), dev extras (pytest, pytest-cov, ruff, mypy), package discovery, and pytest test-path settings. |
Project installs locally without missing package errors for base modules |
Yes - verified with python -m pip install -e ., python -m pip install -e ".[dev]", and python -c "from replicalab.models import ..." |
| FND 04 |
E01 |
Person A |
Add empty Pydantic models and shared type names |
replicalab/models.py |
2026-03-08 |
Created replicalab/__init__.py and replicalab/models.py with the shared action, observation, step, state, and log stubs. |
Import paths resolve for all placeholder models |
Yes - verified with python -c "from replicalab.models import ..." |
| FND 05 |
E01 |
Person C |
Add ignore rules for Python, Node, logs, notebooks, and build artifacts |
.gitignore, .dockerignore |
2026-03-08 |
Added .dockerignore and expanded .gitignore for caches, coverage artifacts, notebook checkpoints, frontend build files, and generated outputs while preserving tracked .gitkeep files. |
Repo status stays clean after local run and build, and Docker build excludes non-runtime files |
Yes |
| FND 06 |
E01 |
Person D |
Add temporary project stub with title, mission, team roles, and local setup placeholder |
README.md |
2026-03-08 |
Replaced the aspirational README with a temporary foundation stub that reflects the current repo state, mission, ownership, and verified setup placeholder. |
New contributor can understand repo purpose in under two minutes |
Yes |
| FND 07 |
E01 |
Person C |
Define branch naming, PR template, and issue template |
.github/ and repo workflow docs |
2026-03-08 |
Added .github/pull_request_template.md and .github/ISSUE_TEMPLATE/task.yml, and documented preferred branch naming patterns plus required tracking-doc updates in docs/project_management_rules.md. |
All future PRs auto show the template and issue fields |
Yes |
| FND 09 |
E01 |
Person A |
Create OpenEnv configuration file specifying environment class, action and observation types, and server settings |
openenv.yaml, pyproject.toml, server/app.py, uv.lock |
2026-03-08 |
Added openenv.yaml, recorded the environment and contract metadata for OpenEnv, added openenv-core plus a server script entry point to pyproject.toml, added main() to server/app.py, and generated uv.lock so the repo passes local OpenEnv validation. |
OpenEnv can discover and serve the environment using this config file |
Yes - verified with uv lock and openenv validate |
| FND 10 |
E01 |
Person C |
Create output directory structure |
replicalab/outputs/ |
2026-03-07 |
Created replicalab/outputs/ with three subdirectories: logs/, replays/, and plots/, all tracked via .gitkeep files. |
Output directories exist and generated files are not committed to git |
Yes |
| MOD 01 |
E02 |
Person A |
Implement ScientistAction schema |
replicalab/models.py, tests/test_models.py, server/app.py |
2026-03-08 |
Replaced the ScientistAction stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so accept preserves the current protocol. |
Valid scientist actions parse and invalid fields raise validation errors |
Yes - verified with python -m pytest tests/test_models.py and a stub-env ScientistAction.model_validate(...) smoke step |
| MOD 02 |
E02 |
Person A |
Implement LabManagerAction schema |
replicalab/models.py, tests/test_models.py |
2026-03-08 |
Replaced the LabManagerAction stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside suggest_alternative, and added focused validation tests. |
Valid lab manager actions parse and invalid fields raise validation errors |
Yes - verified with python -m pytest tests/test_models.py |
| MOD 03 |
E02 |
Person A |
Implement role specific observation models |
replicalab/models.py, tests/test_models.py, server/app.py |
2026-03-08 |
Added typed ConversationEntry and Protocol models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. |
Scientist and lab observations serialize to JSON with stable keys |
Yes - verified with python -m pytest tests/test_models.py and a stub reset() / step() JSON smoke test |
| MOD 04 |
E02 |
Person A |
Implement EpisodeState and EpisodeLog models |
replicalab/models.py, server/app.py, tests/test_models.py |
2026-03-08 |
Replaced the remaining loose dict state and replay fields with typed Protocol, ConversationEntry, and RewardBreakdown models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs. |
Full state round trip serialize plus deserialize works |
Yes - verified with python -m pytest tests/test_models.py |
| MOD 05 |
E02 |
Person A |
Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab |
replicalab/utils/validation.py, tests/test_models.py, tests/test_scenarios.py |
2026-03-08 |
Added deterministic semantic protocol validation with ValidationResult and validate_protocol(...) checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack. |
Invalid protocol examples are rejected with readable reasons |
Yes - verified with python -m pytest tests/test_models.py tests/test_scenarios.py |
| MOD 06 |
E02 |
Person A |
Add semantic validators for impossible plans such as zero sample size with positive controls |
replicalab/utils/validation.py, tests/test_validation.py |
2026-03-08 |
Added _check_semantic_impossibilities() with five checks: zero sample with controls (error), controls >= sample size (error), duplicate controls (warning), duplicate equipment (warning), duplicate reagents (warning). Seven new tests cover all cases plus a regression guard confirming valid protocols still pass. |
Semantic validator catches at least five invalid edge cases |
Yes - verified with python -m pytest tests/test_validation.py (20 tests pass) and full suite (223 passed) |
| MOD 07 |
E02 |
Person C |
Add state serialization helper for replay logs |
replicalab/utils/logging.py, tests/test_logging.py |
2026-03-08 |
Added file-based replay persistence helpers with atomic JSON writes (write_episode_log, load_episode_log) plus CSV reward logging (append_reward_csv). Eleven tests cover lossless round-trip, filename behavior, nested directory creation, transcript and reward-breakdown preservation, CSV headers, append semantics, missing-file errors, and default output targets. |
State logs can be written and loaded without loss |
Yes - verified with python -m pytest tests/test_logging.py (11 tests pass) |
| MOD 10 |
E02 |
Person C |
Publish schema examples for frontend and notebook clients |
tests/fixtures/generate_api_examples.py, tests/fixtures/api_schema_examples.json |
2026-03-08 |
Added a deterministic generator that builds canonical REST and WebSocket example payloads from real Pydantic models and seeded scenario data, then writes a shared api_schema_examples.json fixture for frontend and notebook consumers. The generated examples now use the current deterministic judge metadata instead of stale stub text. |
Frontend and notebook can mock against shared sample payloads |
Yes - verified with python tests/fixtures/generate_api_examples.py and fixture review |
| MOD 11 |
E02 |
Person A |
Implement StepResult model |
replicalab/models.py, server/app.py, tests/test_models.py |
2026-03-08 |
Added typed RewardBreakdown and StepInfo models, upgraded StepResult.info to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly. |
Step result serializes cleanly and all consumers agree on its shape |
Yes - verified with python -m pytest tests/test_models.py |
| MOD 12 |
E02 |
Person A |
Create environment configuration module with shared constants |
replicalab/config.py, server/app.py, replicalab/scenarios/*.py, tests/test_config.py |
2026-03-08 |
Added a shared configuration module for default scenario and difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, then updated the server and scenario builders to import those constants instead of repeating literals. |
All modules import config from one place and no magic numbers remain in env or scoring code |
Yes - verified with python -m pytest tests/test_config.py tests/test_scenarios.py |
| SCN 01 |
E03 |
Person A |
Implement deterministic RNG helper seed_rng() |
replicalab/utils/seed.py, replicalab/scenarios/templates.py |
2026-03-08 |
Added deterministic seed helpers that derive reproducible RNG namespaces for scenario generation. |
Same seed always yields the same random choices and the seed utility is importable from scenarios and env |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 02 |
E03 |
Person A |
Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec |
replicalab/scenarios/templates.py |
2026-03-08 |
Added NormalizedScenarioPack plus strict ScenarioConstraint, ScenarioResource, AllowedSubstitution, and HiddenReferenceSpec models to standardize all scenario families. |
All scenario builders return the same normalized top-level structure and mapper-ready inputs |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 03 |
E03 |
Person A |
Implement mathematics template |
replicalab/scenarios/math_reasoning.py |
2026-03-08 |
Added deterministic mathematics planning templates covering theorem, proof-goal, review, and time constraints. |
Generated scenario passes structure and internal consistency tests |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 04 |
E03 |
Person A |
Implement ML benchmark template |
replicalab/scenarios/ml_benchmark.py |
2026-03-08 |
Added deterministic ML benchmark templates covering dataset, compute, time, and evaluation constraints. |
Generated scenario passes structure and internal consistency tests |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 05 |
E03 |
Person A |
Implement finance and trading planning template |
replicalab/scenarios/finance_trading.py |
2026-03-08 |
Added deterministic offline finance and trading planning templates covering capital, drawdown, slippage, and backtest rules. |
Generated scenario passes structure and internal consistency tests |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 06 |
E03 |
Person A |
Implement difficulty application for easy, medium, hard |
replicalab/scenarios/templates.py, tests/test_scenarios.py |
2026-03-08 |
Added mechanical difficulty scaling that adjusts budgets, time, staff, resource availability, and injected conflict constraints across easy, medium, and hard. |
Difficulty visibly changes the normalized scenario pack in a meaningful way |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 07 |
E03 |
Person A |
Implement normalized constraint and resource generator |
replicalab/scenarios/templates.py, tests/test_scenarios.py |
2026-03-08 |
Added normalized constraint and resource mapping into role-specific observations with consistency checks for unique keys and non-contradictory generated packs. |
No generated scenario contains contradictory constraints or resources |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 08 |
E03 |
Person A |
Implement hidden reference spec and allowed substitutions per template |
replicalab/scenarios/templates.py, tests/test_scenarios.py |
2026-03-08 |
Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. |
Hidden reference clearly marks what is fixed versus flexible for deterministic scoring |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 09 |
E03 |
Person A |
Implement generate_scenario(seed, template, difficulty) |
replicalab/scenarios/templates.py, server/app.py, tests/test_scenarios.py |
2026-03-08 |
Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. |
Function returns a full scenario with deterministic content |
Yes - verified with python -m pytest tests/test_scenarios.py and a _StubEnv.reset(...) smoke test |
| SCN 10 |
E03 |
Person A |
Add seeded generation tests and consistency tests |
tests/test_scenarios.py |
2026-03-08 |
Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. |
Same seed plus template returns the same scenario and different seeds vary |
Yes - verified with python -m pytest tests/test_scenarios.py |
| SCN 13 |
E03 |
Person A |
Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration |
replicalab/scenarios/templates.py, replicalab/scenarios/__init__.py, tests/test_scenarios.py |
2026-03-08 |
Added typed ResourceBooking and SchedulingWindow models, extended NormalizedScenarioPack with deterministic booking and scheduling data, wired seeded generation into the scenario builder across all three domains, and added five scenario tests covering determinism, easy-mode no-conflict behavior, JSON round-trip, valid windows, and domain coverage. |
Constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability |
Yes - verified with python -m pytest tests/test_scenarios.py (13 tests pass) and full suite (304 passed) |
| AGT 09 |
E04 |
Person A |
Add deterministic feasibility checker tests for Lab Manager grounding |
tests/test_lab_manager_policy.py |
2026-03-08 |
Added seventeen deterministic regression tests covering check_feasibility(...), suggest_alternative(...), and compose_lab_manager_response(...) across all three domains, including repeated-run determinism, substitution ordering, duration and sample-size revision stability, never-worsens checks, action-type branching, flag mirroring, and explanation stability. |
Same proposal plus same normalized scenario returns the same checker results every time |
Yes - verified with python -m pytest tests/test_lab_manager_policy.py (37 tests pass) and full suite (304 passed) |
| ENV 01 |
E06 |
Person A |
Create ReplicaLabEnv class skeleton |
replicalab/env/replicalab_env.py, replicalab/env/__init__.py |
2026-03-08 |
Added a real ReplicaLabEnv module as a drop-in replacement for the former in-server stub, ported the working stub behavior into the environment package, wired scenario-pack-backed reset or step or state or close methods with follow-on TODO(ENV XX) markers, and removed the old stub-only marker from StepInfo payloads. |
Environment class imports and instantiates without runtime errors |
Yes - verified with a direct ReplicaLabEnv.reset(...) -> step(...) -> state() -> close() smoke run and python -m pytest (111 passed) |
| JDG 01 |
E05 |
Person A |
Implement rigor or objective-validity score |
replicalab/scoring/rigor.py, replicalab/utils/text.py, tests/test_reward.py |
2026-03-08 |
Added score_rigor(protocol, scenario) with weighted sub-scores for structural completeness (0.30), success criteria coverage (0.40), and required element coverage (0.30). Uses shared element_tokens from replicalab/utils/text.py. Five focused tests in test_reward.py cover quality ordering, determinism, controls impact, rationale length, and all-domain range validation. |
Score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning |
Yes - verified with python -m pytest tests/test_reward.py (18 tests pass) |
| JDG 02 |
E05 |
Person A |
Implement feasibility score |
replicalab/scoring/feasibility.py, tests/test_reward.py |
2026-03-08 |
Added score_feasibility(protocol, scenario, check=None) that derives a continuous [0,1] signal from FeasibilityCheckResult (AGT 05). Seven dimensions weighted equally (1/7) with partial credit for budget, equipment, reagents, and staff. Accepts optional pre-computed check to avoid redundant work. Six focused tests cover viable protocol, infeasible ordering, pre-computed check equivalence, determinism, partial credit, and all-domain range. |
Score is between 0 and 1 and matches normalized constraint logic |
Yes - verified with python -m pytest tests/test_reward.py (18 tests pass) |
| JDG 03 |
E05 |
Person A |
Implement fidelity score |
replicalab/scoring/fidelity.py, tests/test_reward.py |
2026-03-08 |
Added score_fidelity(protocol, scenario) with substitution-aware scoring: required element coverage (0.50, direct match=1.0, substitution=0.7), flexible element alignment (0.20, bonus only), target metric alignment (0.20), and technique appropriateness (0.10). Five focused tests cover aligned vs misaligned ordering, determinism, substitution partial credit, target metric impact, and all-domain range. |
Score is between 0 and 1 and matches rubric examples for plan and evidence alignment |
Yes - verified with python -m pytest tests/test_reward.py (18 tests pass) |
| JDG 04 |
E05 |
Person A |
Implement total reward formula |
replicalab/scoring/rubric.py, tests/test_reward.py |
2026-03-07 |
compute_total_reward(breakdown) implements 10 Γ rigor Γ feasibility Γ fidelity + bonuses β penalties with max(0.0, ...) floor clamp. Eight new tests in test_reward.py verify perfect-vs-broken ordering, zero-feasibility behavior, efficiency bonus ordering, exact penalty subtraction, zero-clamp floor, determinism, external penalties injection, and default-empty penalties. Seven existing rubric tests in test_env.py also cover the formula. |
Total reward formula matches agreed math, clamps at zero, and returns consistent output for plan quality and bounded tool behavior |
Yes - verified with python -m pytest tests/test_reward.py (26 tests pass) and python -m pytest tests/test_env.py (36 tests pass) |
| JDG 05 |
E05 |
Person A |
Build reward breakdown object |
replicalab/scoring/rubric.py, replicalab/scoring/__init__.py, tests/test_reward.py |
2026-03-07 |
build_reward_breakdown(...) accepts an optional penalties: dict[str, float] parameter for named penalty keys (e.g. invalid_tool_use, unsupported_claim) from bounded-tool diagnostics without reopening the model contract. Returns a typed RewardBreakdown with rigor, feasibility, fidelity, efficiency_bonus, communication_bonus, and penalties dict. Exported through replicalab.scoring. |
Breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics extension point |
Yes - verified with python -m pytest tests/test_reward.py (26 tests pass) and python -m pytest tests/test_env.py (36 tests pass) |
| JDG 06 |
E05 |
Person A |
Add optional plain English explanation function from reward breakdown |
replicalab/scoring/explain.py, replicalab/scoring/__init__.py, tests/test_reward.py |
2026-03-08 |
Added explain_reward(...), a deterministic explanation builder that mirrors rigor, feasibility, fidelity, bonuses, penalties, and total reward with stable quality-tier labels and without introducing any new scoring logic. Exported through replicalab.scoring and covered by nine focused tests. |
Explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic |
Yes - verified with python -m pytest tests/test_reward.py (40 tests pass) |
| JDG 08 |
E05 |
Person A |
Add score determinism tests and edge case tests |
tests/test_reward.py |
2026-03-08 |
Added six focused regression tests covering good-vs-awful ordering across all judge axes and total reward, success-criteria sensitivity in rigor scoring, partial equipment credit ordering in feasibility scoring, direct-match vs allowed-substitution vs miss ordering in fidelity scoring, and reward-breakdown determinism with and without a precomputed feasibility check. |
Perfect and broken protocols produce expected relative ordering and scoring remains deterministic across edge cases |
Yes - verified with python -m pytest tests/test_reward.py (40 tests pass) and python -m pytest -q (264 passed) |
| JDG 11 |
E05 |
Person A |
Add structured final audit payload with judge_notes, verdict, and top failure reasons |
replicalab/agents/judge_policy.py, replicalab/agents/__init__.py, tests/test_judge_policy.py |
2026-03-08 |
Created JudgeAudit model and build_judge_audit() builder that derives verdict (accept/timeout/no_agreement), reuses explain_reward() for judge_notes, and extracts top failure reasons from weak rubric components and penalty keys. Exported through replicalab.agents. Ten tests cover all three verdict paths, component-driven failure reasons, penalty surfacing, reason cap, good-protocol empty reasons, determinism, and JSON round-trip. |
Final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI |
Yes - verified with python -m pytest tests/test_judge_policy.py (10 tests pass) and full suite (255 passed) |
| ENV 02 |
E06 |
Person A |
Implement real reset wiring |
replicalab/env/replicalab_env.py |
2026-03-08 |
_make_observation() now uses the scenario pack as source of truth for booked/out-of-stock/safety data instead of empty placeholders. Eight reset tests verify both roles populated, booked/out-of-stock preserved, all templates and difficulties. |
Reset returns initial observations with full scenario data |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| ENV 03 |
E06 |
Person A |
Implement Scientist turn with validation |
replicalab/env/replicalab_env.py |
2026-03-08 |
Added _validate_scientist_action() that runs validate_protocol() on proposals and returns structured error strings without crashing the env. Invalid actions don't advance the round. |
Valid action updates state, invalid action returns structured error |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| ENV 04 |
E06 |
Person A |
Implement Lab Manager response step |
replicalab/env/replicalab_env.py |
2026-03-08 |
_lab_manager_action() uses the full grounded pipeline: check_feasibility() β suggest_alternative() β compose_lab_manager_response(). |
Lab Manager response is grounded in feasibility check results |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| ENV 05 |
E06 |
Person A |
Centralize termination logic |
replicalab/env/replicalab_env.py |
2026-03-08 |
Added _check_termination(): Scientist accept with existing protocol OR max_rounds. Lab Manager accept does NOT auto-terminate. |
Episode terminates on agreement or round limit |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| ENV 06 |
E06 |
Person A |
Wire real judge scoring |
replicalab/env/replicalab_env.py, tests/test_env.py |
2026-03-07 |
Terminal accept steps call build_reward_breakdown() and compute_total_reward() with real rigor/feasibility/fidelity scores stored in EpisodeState. Terminal-without-agreement path now distinguishes timeout (max rounds) from no_agreement verdict. Four new tests in TestEnvReward verify agreement-terminal breakdown/notes/verdict, no-agreement determinism, timeout verdict, and state-stored component scores. |
Final step returns total reward, breakdown info, and deterministic penalties or bonuses; verdict distinguishes timeout from no_agreement |
Yes - verified with python -m pytest tests/test_env.py (36 tests pass) and python -m pytest (178 tests pass) |
| ENV 07 |
E06 |
Person A |
Implement state() deep snapshot |
replicalab/env/replicalab_env.py |
2026-03-08 |
state() now returns self._state.model_copy(deep=True) so callers get an independent snapshot. Two tests verify mutation isolation. |
State snapshot is independent of env internals |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| ENV 08 |
E06 |
Person A |
Implement close() with lifecycle guard |
replicalab/env/replicalab_env.py |
2026-03-08 |
Added _closed flag, idempotent close(), _ensure_open() guard on step(), and reset() reopens a closed env. Three tests verify idempotency, step-after-close raises, and reset-reopens. |
Close frees resources and does not throw; step after close raises |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| ENV 10 |
E06 |
Person A |
Add reset, step, invalid action, timeout, and deterministic replay tests |
tests/test_env.py |
2026-03-08 |
Added a dedicated replay-determinism regression block that verifies same seed plus same actions yields the same initial observation, step trajectory, timeout terminal path, invalid-action behavior, and audit payload across math, ML, and finance families. The new coverage keeps replay deterministic without depending on file-backed logging. |
Tests pass for seeded reset, valid step, invalid step, timeout, and replay consistency across supported scenario families |
Yes - verified with python -m pytest tests/test_env.py (56 tests pass) and full suite (327 passed) |
| ENV 11 |
E06 |
Person A |
Attach judge audit payload to final StepResult, terminal observations, and replay state |
replicalab/models.py, replicalab/env/replicalab_env.py, server/app.py, tests/test_env.py, tests/test_server.py |
2026-03-08 |
Added top_failure_reasons to StepInfo, EpisodeState, and EpisodeLog; terminal env steps now build a canonical audit via build_judge_audit(...); and replay log construction now persists top_failure_reasons from terminal StepResult.info instead of dropping them. Seven env tests cover terminal audit behavior and a replay test verifies the audit reasons survive into GET /replay/{episode_id} payloads. |
Completed episodes expose audit notes alongside reward breakdown in a stable schema across env state and replay |
Yes - verified with python -m pytest tests/test_env.py (43 tests pass), python -m pytest tests/test_server.py (37 tests pass), and full suite (314 passed) |
| OBS 04 |
E10 |
Person A |
Add deterministic replay test using seed and action sequence |
tests/test_env.py |
2026-03-08 |
Closed the observability-side replay guard by reusing the new seeded replay-determinism suite in TestReplayDeterminism, which verifies same-seed same-action trajectories, timeout replay determinism, invalid-action replay determinism, and stable terminal audit payloads across all three scenario families. |
Replay of the same seed and action sequence matches the prior state sequence deterministically |
Yes - verified with python -m pytest tests/test_env.py (56 tests pass) and full suite (327 passed) |
| TST 01 |
E11 |
Person A |
Add reset returns valid observations test |
tests/test_env.py |
2026-03-08 |
Eight tests in TestReset class covering both roles populated, scientist fields, lab manager fields, booked/out-of-stock preservation, state round zero, episode ID, clearing previous episode, and all templates/difficulties. |
Test confirms both roles receive valid structured observations |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| TST 02 |
E11 |
Person A |
Add valid action step test |
tests/test_env.py |
2026-03-08 |
Eight tests in TestStep class covering round advancement, observation shape, conversation history, accept termination, real reward scores, max round termination, step info fields, and full propose-then-accept episode. |
Valid action advances round and returns correct shape |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| TST 03 |
E11 |
Person A |
Add invalid action handling test |
tests/test_env.py |
2026-03-08 |
Four tests in TestInvalidAction class covering error string on invalid duration, env survival after error, no round advancement on invalid action, and request_info always passes. |
Invalid action yields structured error and env survives |
Yes - verified with python -m pytest tests/test_env.py (32 tests pass) |
| TST 04 |
E11 |
Person A |
Add perfect protocol high reward test |
tests/test_reward.py |
2026-03-08 |
Added reward-regression coverage proving a fully aligned protocol scores higher than a broken baseline and stays ordered consistently across reruns. |
Perfect protocol scores higher than baseline and broken protocol |
Yes - verified with python -m pytest tests/test_reward.py (26 tests pass) |
| TST 05 |
E11 |
Person A |
Add zero dimension or penalty behavior test |
tests/test_reward.py |
2026-03-08 |
Added reward-regression coverage for zero-feasibility collapse, exact penalty subtraction, and zero-floor clamp behavior so timeout and penalty paths lower reward deterministically. |
Zero feasibility or timeout lowers reward as expected |
Yes - verified with python -m pytest tests/test_reward.py (26 tests pass) |
| MOD 08 |
E02 |
Person A |
Write unit tests for schemas and validators |
tests/test_mod08_schemas.py |
2026-03-08 |
Created 70 comprehensive unit tests covering all Pydantic model edge cases: ScientistAction (15 tests for all action types, mixed-mode rejection, whitespace stripping, empty/negative field rejection), LabManagerAction (11 tests for all action types, feasible-flag consistency, suggestion-field rules), Protocol (10 tests for boundary values, stripping, extra-field rejection), ConversationEntry (7 tests for null/empty action_type, role validation), RewardBreakdown (9 tests for boundary values, range rejection), Observation (4 tests for both-none, single-role), LabManagerObservation (3 tests for negative fields, stripping), StepInfo (3 tests for extra-field allowance), StepResult (3 tests), EpisodeState (2 tests), EpisodeLog (3 tests for failure reasons, model_dump keys). |
Tests cover valid parse, invalid parse, and replay serialization |
Yes - verified with python -m pytest tests/test_mod08_schemas.py -v (70 passed) and full suite (409 passed) |
| API 03 |
E07 |
Person C |
Add POST /step endpoint |
server/app.py, tests/test_server.py |
2026-03-07 |
Fixed _build_episode_log() to take the real StepResult instead of rebuilding reward data from state with stale stub values. Both REST /step and WebSocket step handler now pass the terminal StepResult to the updated helper so replay logs use real reward_breakdown, judge_notes, and verdict (including timeout vs no_agreement). Added five endpoint tests covering reset-then-step happy path, invalid session ID 404, terminal step with real reward breakdown, semantic invalid action returning 200 with info.error, and replay with real judge data. |
Step endpoint accepts valid action and returns step result |
Yes - verified with python -m pytest tests/test_server.py (10 tests pass) and python -m pytest (183 tests pass) |
| API 06 |
E07 |
Person C |
Add WebSocket session handler with isolated env per connection |
server/app.py, tests/test_server.py |
2026-03-07 |
WebSocket handler at /ws supports reset, step, and ping message types with per-connection env isolation, idle timeout, and replay storage on terminal episodes. Twelve WebSocket tests cover ping-pong, reset observation, step result, full episode real reward, invalid JSON, missing action field, invalid action payload, unknown message type, session isolation, semantic invalid action returning step_ok with info.error, timeout verdict proving real-env integration, and terminal episode replay persistence via GET /replay/{episode_id}. |
WebSocket session handler supports reset, step, ping with isolated env per connection and correct replay storage |
Yes - verified with python -m pytest tests/test_server.py (22 tests pass) and python -m pytest (195 tests pass) |
| TST 07 |
E11 |
Person C |
Add WebSocket session handler tests |
tests/test_server.py |
2026-03-07 |
Twelve focused WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict, and replay log persistence with real judge data. Tests verify that structurally valid but semantically invalid actions return step_ok with info.error (not WS error frames), matching the env contract. |
WebSocket tests cover happy path, error handling, session isolation, and real-env integration |
Yes - verified with python -m pytest tests/test_server.py (22 tests pass) |
| API 02 |
E07 |
Person C |
Add POST /reset endpoint |
server/app.py, tests/test_server.py |
2026-03-08 |
/reset endpoint creates a new env (or closes the prior one when reusing session_id), calls env.reset(...), persists env, last_active, and episode_id in the in-memory REST session store, and returns session_id, episode_id, observation. Seven dedicated tests cover response shape, both-role observation, explicit session_id reuse, prior-env close on reuse, default params, all scenario/difficulty combos, and seed determinism. |
Reset endpoint starts a new episode and returns initial observation |
Yes - verified with python -m pytest tests/test_server.py (29 tests pass) and python -m pytest (202 tests pass) |
| API 04 |
E07 |
Person C |
Add GET /scenarios endpoint |
server/app.py, tests/test_server.py |
2026-03-08 |
GET /scenarios returns the available_scenario_families() output through the typed ScenariosResponse model. Five focused tests cover status code, response shape, all three scenario families, the expected easy, medium, and hard difficulties, and the absence of extra keys. |
Endpoint lists available scenario families and difficulties |
Yes - verified with python -m pytest tests/test_server.py -v (34 tests pass) |
| API 07 |
E07 |
Person C |
Add idle timeout and graceful disconnect cleanup |
server/app.py, tests/test_server.py |
2026-03-08 |
Verified the existing WebSocket idle-timeout and disconnect cleanup path with two focused tests: one monkeypatches the idle timeout to 0.5s and confirms the server closes with code 1000 when no message arrives, and one wraps _make_env() to confirm env.close() is called exactly once from the finally block on disconnect. |
Stale connections close cleanly and the environment closes without leak |
Yes - verified with python -m pytest tests/test_server.py -v (34 tests pass) |
| API 13 |
E07 |
Person C |
Add CORS middleware configuration for frontend origins in dev and production |
server/app.py, tests/test_server.py |
2026-03-08 |
Confirmed the existing FastAPI CORS middleware allows the local Vite frontend origin plus https://*.hf.space, and added three explicit preflight tests covering localhost allowance, HF Space allowance, and disallowed-origin rejection. |
Frontend on localhost:5173 and HF Space origin can reach the API without CORS errors |
Yes - verified with python -m pytest tests/test_server.py -v (34 tests pass) |
| API 08 |
E07 |
Person C |
Build Dockerfile with Python app startup on port 7860 |
server/Dockerfile, Dockerfile, server/requirements.txt, docs/max/deployment.md |
2026-03-08 |
Fixed editable install (-e . β . --no-deps) in both server/Dockerfile and root Dockerfile, added httpx and websocket-client to server/requirements.txt (required by replicalab.client), rebuilt without cache. Verified Docker container starts with the real env ("env":"real"), and all four endpoints work: GET /health, GET /scenarios, POST /reset, POST /step. Added verified endpoint commands to docs/max/deployment.md. |
Local Docker run serves app on port 7860 |
Yes - verified with docker build -f server/Dockerfile -t replicalab . && docker run -p 7860:7860 replicalab and curl against all four endpoints |
| API 09 |
E07 |
Person C |
Add Hugging Face Space metadata and deploy instructions |
README.md, Dockerfile, docs/max/deployment.md |
2026-03-08 |
Added the Hugging Face Spaces YAML frontmatter to the root README, created the root-level Dockerfile required by the Docker SDK, and documented Space creation, git remote setup, push, logs, and secret management in docs/max/deployment.md. |
Space config is valid for Docker app deployment |
Yes - verified against HF Spaces Docker deployment requirements |
| API 15 |
E07 |
Person C |
Create HF Space README.md with YAML frontmatter |
README.md |
2026-03-08 |
Added the required Spaces frontmatter fields (sdk: docker, app_port: 7860, title, emoji, colors, pinned) to the root README so Hugging Face parses the Space metadata correctly on push. |
HF Space config is valid and Space launches correctly from the metadata |
Yes - verified against the HF Spaces frontmatter schema |
| API 14 |
E07 |
Person C |
Add REST session management so each user gets isolated environment state |
tests/test_api_rest_isolation.py |
2026-03-08 |
Created 11 dedicated REST session isolation tests in a standalone file covering: two resets produce different sessions, independent observations across scenarios, stepping one session does not mutate the other, independent round counts, terminal isolation, session_id reuse creates new episode and resets rounds, reuse does not affect other sessions, 404 on nonexistent session, step-after-terminal behavior, and replay isolation between sessions. No server changes needed β isolation already works correctly. |
Two concurrent REST users do not share or corrupt each other's episode state |
Yes - verified with python -m pytest tests/test_api_rest_isolation.py (11 tests pass) and full suite (307 passed) |
| API 10 |
E07 |
Person C |
Deploy live Space and verify health, reset, and step |
docs/max/deployment.md, README.md |
2026-03-08 |
Verified the live HF Space at https://ayushozha-replicalab.hf.space with all four endpoints: GET /health (200, env=real), GET /scenarios (200, 3 families), POST /reset (200, returns session_id/episode_id/observation), POST /step (200, returns reward/done/info). Ran a full episode (propose β accept) with real judge scoring: rigor=0.465, feasibility=1.000, fidelity=0.325, total_reward=2.313, verdict=accept. Updated deployment docs and README with verified live URL. |
Live Space responds successfully and one end-to-end episode works on the hosted env |
Yes - verified with httpx requests against https://ayushozha-replicalab.hf.space |
| API 17 |
E07 |
Person C |
Document secrets and API key management for hosted deployment and Colab |
docs/max/deployment.md |
2026-03-08 |
Documented that the server is fully self-contained with no external API calls or secrets required. Added secrets reference table for all four contexts (HF Space, local dev, Docker, Colab notebook) with HF_TOKEN for model downloads and REPLICALAB_URL for hosted env. Documented Colab Secrets panel setup. Added future secrets section for an optional hosted evaluator. |
Secrets setup is documented clearly enough for another teammate to reproduce |
Yes - verified by inspecting server/app.py for env var references (none found) and documenting the complete secrets landscape |
| JDG 07 |
E05 |
Person C |
Log reward breakdown to CSV or JSONL per episode |
replicalab/utils/logging.py, tests/test_logging.py |
2026-03-08 |
Verified existing implementation: append_reward_csv() writes per-episode rows with all V2 columns (parsimony, bonuses, penalty total, verdict), append_reward_jsonl() preserves nested penalty dicts and bounded-tool metrics, and log_episode_reward() writes to both formats. 22 tests in tests/test_logging.py cover CSV creation, header dedup, JSONL records, default breakdowns, nested penalty preservation, determinism, and the convenience wrapper. No code changes needed. |
Reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics |
Yes - verified with python -m pytest tests/test_logging.py -v (22 passed) and full suite (409 passed) |
| API 01 |
E07 |
Person C |
Create FastAPI app shell and health endpoint |
server/app.py |
2026-03-08 |
Verified the FastAPI app shell is fully functional: GET /health returns 200 with {"status":"ok","env":"real"}, the app imports and wires ReplicaLabEnv, logging is configured via env vars, CORS middleware is active, and all downstream endpoints (reset, step, scenarios, replay, WebSocket) are operational. Server endpoint tests in tests/test_server.py (34 tests) and REST isolation tests (11 tests) confirm full coverage. No code changes needed β task was already complete beyond its acceptance criteria. |
GET /health returns 200 with simple payload |
Yes - verified with existing tests and full suite (409 passed) |
| OBS 02 |
E10 |
Person C |
Add local log levels and readable console formatting |
replicalab/config.py, server/app.py |
2026-03-08 |
Verified logging already meets all acceptance criteria: REPLICALAB_LOG_LEVEL env var toggles log verbosity without code edits (default INFO), LOG_FORMAT provides readable %(asctime)s [%(levelname)s] %(name)s: %(message)s layout, and server/app.py wires both via logging.basicConfig(). No code changes needed. |
Debug logs can be toggled without code edits |
Yes - verified by reading replicalab/config.py (lines 30-31) and server/app.py (lines 75-79) |
| ENV 09 |
E06 |
Person C |
Write episode logs on completion |
server/app.py |
2026-03-08 |
Added write_episode_log() and log_episode_reward() calls to server/app.py in both REST /step and WebSocket step handlers. Terminal episodes now auto-persist replay JSON and reward CSV/JSONL to disk. |
Completed episodes generate replayable logs automatically |
Yes - verified with terminal episode persistence through REST and WebSocket paths |
| OBS 09 |
E10 |
Person C |
Extend episode summary with audit metadata |
replicalab/models.py |
2026-03-08 |
Added invalid_action_count (int) and invalid_action_rate (float) fields to EpisodeLog in replicalab/models.py. Server tracks invalid actions per session and per WebSocket connection. |
Every completed episode log contains the audit payload plus demo and evaluation metrics |
Yes - verified with model field presence and server-side tracking |
| OBS 07 |
E10 |
Person C |
Script to run one episode and dump logs |
scripts/run_episode.py |
2026-03-08 |
Created scripts/run_episode.py that resets the env, runs a baseline propose-then-accept episode, and writes replay JSON plus reward CSV/JSONL. |
One command produces a complete local sample log |
Yes - verified with script execution producing replay and reward files |
| TST 11 |
E11 |
Person C |
Judge audit payload contract tests |
tests/test_audit_contract.py |
2026-03-08 |
Created tests/test_audit_contract.py with 17 tests across 3 classes: StepInfoAuditContract (6 tests), EpisodeLogAuditContract (6 tests), AuditModelContracts (5 tests). |
Tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics |
Yes - verified with python -m pytest tests/test_audit_contract.py (17 tests pass) |
| API 05 |
E07 |
Person C |
Add GET /replay/{episode_id} endpoint |
server/app.py |
2026-03-08 |
Already implemented at server/app.py line 536-540. Endpoint returns completed episode log JSON for a valid episode ID. |
Endpoint returns completed log for valid episode id |
Yes - verified with existing replay endpoint tests |
| API 11 |
E07 |
Person C |
Add server endpoint tests and WebSocket smoke test |
tests/test_server.py |
2026-03-08 |
Already implemented in tests/test_server.py with 44 tests covering health, reset, step, scenarios, replay, WebSocket connectivity, error handling, session isolation, and smoke paths. |
Local server tests pass for health, reset, step, invalid payload, and ws connect |
Yes - verified with python -m pytest tests/test_server.py (44 tests pass) |
| API 18 |
E07 |
Person C |
Include judge audit payload in terminal responses |
server/app.py |
2026-03-08 |
Already implemented. Terminal StepInfo includes judge_notes, verdict, and top_failure_reasons from the real judge audit in both REST and WebSocket paths. |
Clients receive judge_notes, verdict fields, and bounded tool audit data without separate log file access |
Yes - verified with terminal response inspection and audit contract tests |
| OBS 01 |
E10 |
Person C |
Standardize episode log schema |
replicalab/models.py |
2026-03-08 |
Already implemented. EpisodeLog model in replicalab/models.py is the canonical schema with all required fields for transcript, state snapshots, scores, and audit metadata. |
Every completed episode log contains the same required fields |
Yes - verified with EpisodeLog model inspection and schema tests |
| OBS 03 |
E10 |
Person C |
Episode id generation and file naming conventions |
replicalab/utils/logging.py |
2026-03-08 |
Already implemented. UUID generation in env, {episode_id}.json naming in replicalab/utils/logging.py. Logs never overwrite because each episode gets a unique UUID. |
Logs never overwrite and are easy to locate |
Yes - verified with replay file naming behavior |
| TST 06 |
E11 |
Person C |
Health plus reset plus step endpoint tests |
tests/test_server.py |
2026-03-08 |
Already implemented in tests/test_server.py with TestHealthEndpoint, TestResetEndpoint, and TestStepEndpoint classes. |
API tests pass locally |
Yes - verified with python -m pytest tests/test_server.py |