replicalab / docs /changes.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84
# Change Log
This file records deviations from the original project plan.
Rules:
- Append new entries; do not rewrite history unless a prior entry is factually wrong.
- Record the contributor, the task or area, the deviation, and the reason.
- Update this file in the same branch or PR as the deviation whenever possible.
| Date | Contributor | Task or Area | Deviation | Reason | Impact | Follow-up |
| --- | --- | --- | --- | --- | --- | --- |
| 2026-03-07 | Person B (Ayush) | FND 01 | Executed the task even though it was assigned to Person C | The repo scaffold was missing and needed immediately to unblock foundation work | Repo structure was created and tracking docs were updated to reflect the actual executor | None |
| 2026-03-08 | Person B (Ayush) | FND 02 | Executed the task even though it was assigned to Person C | The Python package config was needed to verify editable installs and unblock `FND 11` | `pyproject.toml` was added, install verification was run, and tracking docs were updated | `FND 11` is now unblocked |
| 2026-03-07 | Person B (Ayush) | FND 10 | Executed the task even though it was assigned to Person C | The output directories were still missing after the initial scaffold and needed for backlog compliance | `replicalab/outputs/` and subdirectories were added and tracking docs were updated | None |
| 2026-03-08 | Person B (Ayush) | FND 04 | Executed the task even though it was assigned to Person A | The shared contract stubs were needed to unblock `FND 08` and downstream schema work | `replicalab/models.py` was created and tracking docs were updated | None |
| 2026-03-08 | Person B (Ayush) | FND 05 | Executed the task even though it was assigned to Person C | Ignore rules were incomplete and needed to keep generated artifacts out of git and Docker context | `.gitignore` and `.dockerignore` were updated and tracking docs were aligned | None |
| 2026-03-08 | Person B (Ayush) | FND 06 | Executed the task even though it was assigned to Person D | The existing README described a future state and needed to become an honest temporary stub for new contributors | `README.md` now reflects the current foundation stage and verified setup placeholder | `DOC 01` is now unblocked |
| 2026-03-08 | Person B (Ayush) | FND 07 | Executed the task even though it was assigned to Person C | GitHub templates and explicit repo workflow artifacts were needed to reduce coordination overhead | PR and task templates were added and the project-management rules were tightened | Future PRs and task issues should use the new templates |
| 2026-03-08 | Person B (Ayush) | Project management | Added governance docs and a deviation log outside the original backlog | Coordination overhead and tracking drift had become a project-management risk | New repo rules now govern future task tracking, docs updates, and deviation logging | Keep these docs in sync with future work |
| 2026-03-08 | Person B (Ayush) | Project management | Replaced placeholder owner-doc folders with real-name folders for Kian, Max, and Kush | The team standardized on real names for owner-facing docs before future merges | Owner docs now live under `docs/kian/`, `docs/max/`, and `docs/kush/`, and the governance docs record the mapping | Use real-name folders for future owner-doc updates |
| 2026-03-08 | Person B (Ayush) | PR 7 import for Max | Normalized a stale contributor PR before merge instead of merging it directly | The incoming branch would have deleted governance docs, reverted current task tracking, and overstated backend task completion | Only the validated backend subset was imported, `FND 11` was marked complete, and the stub-backed API work was recorded as partial | Real-env wiring, Docker validation, and deployment verification still remain |
| 2026-03-08 | Person B (Ayush) | FND 08 and FND 09 | Recorded Kian-side sign-off for the shared contract and executed `FND 09` even though it was assigned to Person A | The same contributor is currently covering both the Kian and Ayush lanes, and the OpenEnv registration layer needed to be real rather than left as a placeholder | `FND 08` is now complete, `openenv.yaml` exists, and the repo now carries the minimal OpenEnv runtime wiring needed for local validation | The real environment class in `replicalab/env/replicalab_env.py` is still a later task |
| 2026-03-08 | Person B (Ayush) | MOD 01 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict `ScientistAction` validator was the highest-leverage unblocker for downstream parser and validation work | `ScientistAction` now enforces the frozen contract, `MOD 09` and `MOD 05` are unblocked, and focused schema tests now exist in `tests/test_models.py` | `MOD 03` is the next schema-critical Kian task |
| 2026-03-08 | Person B (Ayush) | MOD 02 and MOD 03 | Executed the tasks even though they were assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict Lab Manager plus typed observation contracts were the fastest way to stabilize the shared schema surface before parser, state, and environment work fan out | `LabManagerAction`, `ConversationEntry`, `Protocol`, and both observation branches now enforce the frozen contract, `MOD 04` and `MOD 11` are unblocked, and the stub server path is verified against the typed models | `MOD 12`, `SCN 01`, and `MOD 05` are the next Kian-lane tasks |
| 2026-03-08 | Person B (Ayush) | MOD 12 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and centralizing shared defaults was the cleanest way to stop config drift before the real environment and scoring modules expand | `replicalab/config.py` now holds shared defaults for scenario selection, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, and the server plus scenario builders import them instead of repeating literals | `MOD 05`, `MOD 04`, and `MOD 11` remain the next Kian-lane foundation tasks |
| 2026-03-08 | Person B (Ayush) | MOD 11 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and a typed step-result contract was needed before the environment, API, replay, and training paths grew around loose metadata | `RewardBreakdown`, `StepInfo`, and typed `StepResult.info` now exist, and the stub runtime explicitly constructs those reserved-key payloads while preserving debug metadata | `MOD 04` and `MOD 05` were the remaining Kian-lane foundation tasks after this |
| 2026-03-08 | Person B (Ayush) | MOD 04 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and state plus replay needed to use the same typed protocol and conversation models already enforced at the action and observation layers | `EpisodeState` and `EpisodeLog` now carry typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` fields, the stub runtime constructs those nested models explicitly, and replay serialization is now aligned with the typed contract | `MOD 07` and `ENV 01` are now unblocked |
| 2026-03-08 | Person B (Ayush) | MOD 05 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and structural schema validation was not enough to stop impossible or hallucinated plans from reaching the environment | `replicalab/utils/validation.py` now provides deterministic protocol validation against normalized scenario resources, substitutions, time limits, and required elements, returning structured issues instead of relying on ad hoc runtime checks | `MOD 06` and shared `AGT 05` are now unblocked |
| 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
| 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
| 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
| 2026-03-08 | Person B (Ayush) | Capability scope and backlog | Expanded the MVP from pure constrained negotiation to bounded evidence-backed research planning with scoped search, code-check, and image-inspection capability, while explicitly excluding audio and unrestricted live web in training | The team decided that research applicability requires richer capabilities, but the hackathon still needs a deterministic RL story with bounded tools and reproducible rewards | The source-of-truth backlog now treats richer capabilities as an additive layer below the frozen outer contract; completed schema and agent work stays valid, while pending prompt, judge, environment, API, and training tasks now absorb bounded tool and evidence-pack support | Keep live web mostly for demo or eval validation, and keep frozen evidence packs as the default training path |
| 2026-03-07 | Person B (Ayush) | AGT 03 | Backlog showed "Not started" but the implementation (parse-and-retry loop with telemetry) already existed from a prior commit | The code and 7 tests were committed earlier but the tracker was never updated | Synced both `ReplicaLab_Comprehensive_Task_Division.md` and `docs/completion.md` to reflect completed status | None |
| 2026-03-07 | Person B (Ayush) | AGT 08 | Expanded scope from test-only to tests plus a bounded-tool policy prompt patch in `build_scientist_system_prompt()` | The acceptance criteria required testing bounded-tool policy reminders, but no tool-policy text existed in the prompt yet; user directed adding the prompt text alongside the tests | Added policy block for `search_evidence`, `run_code_check`, and `inspect_image` to the system prompt; wrote 24 new tests covering parser, prompt, formatter, baseline, and bounded-tool policy; all 111 tests pass | None |
| 2026-03-08 | Person B (Ayush) | ENV 01 | Executed the task even though it was assigned to Person A | The real environment class was still missing, but the server now switches to `ReplicaLabEnv` on successful import, so a working drop-in module was needed before environment and API work could safely proceed | Added `replicalab/env/replicalab_env.py` and `replicalab/env/__init__.py` as a working drop-in replacement for the former in-server stub, verified direct `reset() -> step() -> state() -> close()` behavior, and confirmed the full test suite stays green at `111 passed` | `ENV 02` and `ENV 08` are now unblocked, and the server can instantiate the real env class instead of the fallback stub |
| 2026-03-08 | Person B (Ayush) | JDG 01, JDG 02, JDG 03 | Executed three scoring tasks assigned to Person A | The judge scoring chain was the next critical-path blocker: JDG 04 (total reward formula) depends on all three, and ENV 06 (reward integration) depends on JDG 05 which depends on JDG 04 | Added `replicalab/scoring/rigor.py` (weighted structural completeness, success criteria coverage, required element coverage), `replicalab/scoring/feasibility.py` (7-dimension partial-credit scorer wrapping AGT 05 feasibility checker), `replicalab/scoring/fidelity.py` (substitution-aware hidden-reference adherence scorer), shared `replicalab/utils/text.py` (token extraction and label normalization), `replicalab/scoring/__init__.py` (exports), and `tests/test_reward.py` (18 tests covering ordering, determinism, partial credit, domain range, and cross-scorer consistency); all 134 tests pass | JDG 04 is now unblocked; tracker docs were synced separately |
| 2026-03-08 | Person B (Ayush) | ENV 02, ENV 03, ENV 04, ENV 05, ENV 06, ENV 07, ENV 08, JDG 04, JDG 05, TST 01, TST 02, TST 03 | Executed the full environment chain and rubric tasks assigned to Person A | The environment needed real scenario wiring, validation, grounded Lab Manager responses, centralized termination, judge-computed rewards, deep state snapshots, and close lifecycle guards; the rubric needed the total reward formula and breakdown builder; and the test suite needed reset, step, and invalid-action coverage | Rewrote `replicalab/env/replicalab_env.py` (ENV 02-08: scenario-pack-backed observations, protocol validation, grounded LM pipeline, accept-or-max-rounds termination, real judge scoring via rubric, deep state copies, closed-env guard), created `replicalab/scoring/rubric.py` (JDG 04-05: `compute_total_reward` with `10 × r × f × fi + bonuses − penalties`, `build_reward_breakdown` composing all three sub-scores with efficiency bonus), updated `replicalab/scoring/__init__.py` exports, and created `tests/test_env.py` (TST 01-03: 32 tests covering reset, step, invalid action, state snapshot, close/reopen, and rubric); all 166 tests pass | JDG 06, JDG 08, ENV 10, ENV 11, TST 04, TST 05 are now unblocked; partial server tasks (API 02, 03, 06, 07) can now wire against the real env |
| 2026-03-07 | Person B (Ayush) | JDG 04, JDG 05, ENV 06 finalization | Refined the draft implementations to match final acceptance criteria | JDG 04 needed a zero-clamp floor and JDG 05 needed a named-penalty extension point for bounded-tool diagnostics; ENV 06 needed to distinguish timeout from no-agreement verdicts | `compute_total_reward` now clamps at 0.0; `build_reward_breakdown` accepts optional `penalties: dict[str, float]` for named penalty keys like `invalid_tool_use` and `unsupported_claim`; terminal-without-agreement path now returns `timeout` when max rounds reached vs `no_agreement` otherwise; added 8 new tests in `test_reward.py` and 4 new tests in `test_env.py`; 178 tests pass across the full suite | None |
| 2026-03-07 | Person B (Ayush) | API 03 | Completed the `POST /step` endpoint task assigned to Person C by fixing stale replay logging and adding endpoint tests | The `_build_episode_log()` helper still hardcoded stub audit notes, rebuilt `RewardBreakdown` from state, and used `accept`/`revise` instead of the real `timeout`/`no_agreement` verdicts; both REST and WebSocket terminal paths used the stale helper; and no `/step` endpoint tests existed | Updated `_build_episode_log()` to accept the terminal `StepResult` and use its real `reward_breakdown`, `judge_notes`, and `verdict`; updated both REST `/step` and WebSocket step completion paths to pass the result; fixed `_StubEnv` reference to removed helper; added five endpoint tests covering happy path, invalid session 404, terminal real reward breakdown, semantic invalid action as 200 with `info.error`, and replay with real judge data; all 183 tests pass | API 14 and API 18 are now closer to completion; TST 06 is partially covered by the new tests |
| 2026-03-07 | Person B (Ayush) | API 06 and TST 07 | Executed the WebSocket session handler task and its test task even though both were assigned to Person C | The WebSocket handler already existed in `server/app.py` but had no test coverage, and completing `API 06` was needed to unblock `TRN 03` and `TRN 13` in Person B's own lane | Added 12 WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict with real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`; all 195 tests pass; `TRN 03` and `TRN 13` are now unblocked for Person B | `TRN 03` and `TRN 13` are now the next Person B tasks |
| 2026-03-08 | Person B (Ayush) | API 13 | Executed the task even though it was assigned to Person C | The CORS middleware already existed in `server/app.py`, but the task was still partial because frontend-origin verification had not been made explicit | Added three server tests covering localhost Vite preflight, Hugging Face Space origin preflight, and disallowed-origin rejection; `API 13` is now recorded complete in the source of truth and owner trackers | `API 02`, `API 04`, `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
| 2026-03-08 | Person B (Ayush) | API 04 | Executed the task even though it was assigned to Person C | The `/scenarios` endpoint and its focused tests already met the acceptance criteria, but the task was still marked partial in the trackers | Recorded `API 04` complete in the source of truth and owner trackers based on the existing typed response model, normalized family list, and five dedicated endpoint tests | `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
| 2026-03-08 | Person B (Ayush) | API 02 | Completed the `POST /reset` endpoint verification and test closure even though the task was assigned to Person C | The endpoint already worked against the real env via `_make_env()` but had no dedicated test coverage and was still marked partial in the tracker | Added seven dedicated `/reset` endpoint tests covering response shape, both-role observation, explicit session_id reuse with prior-env close, default params, all scenario and difficulty combos, and seed determinism; all 202 tests pass; `API 14` and `UI 06` are now closer to completion | None |
| 2026-03-08 | Person B (Ayush) | TRN 13 | Implemented `replicalab/client.py` as specified in the task backlog | `API 06` was complete and `TRN 13` was the next unblocked Person B task | Created `ReplicaLabClient` with dual-transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager, internal session tracking, typed Pydantic returns, and 24 tests covering both transports; all 231 tests pass | `TRN 03` is now the next unblocked Person B task |
| 2026-03-08 | Person B (Ayush) | API 07 | Completed the WebSocket idle-timeout and graceful-disconnect verification even though the task was assigned to Person C | The idle-timeout logic and `finally: env.close()` path already existed in `server/app.py`, but the task was still partial because resource-cleanup verification had not been made explicit | Added two focused WebSocket tests covering idle timeout close code `1000` and exactly-once `env.close()` on disconnect; `API 07` is now recorded complete in the source of truth and owner trackers | `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
| 2026-03-08 | Person B (Ayush) | API 08 | Completed the Docker build and run verification even though the task was assigned to Person C | The Dockerfile existed but had never been verified end to end; editable install failed inside Docker, and `httpx` plus `websocket-client` were missing from `server/requirements.txt` | Fixed `pip install -e .` to `pip install .` in both `server/Dockerfile` and root `Dockerfile`; added `httpx` and `websocket-client` to `server/requirements.txt`; rebuilt without cache; verified container starts with `"env":"real"` and all four endpoints (`/health`, `/scenarios`, `/reset`, `/step`) respond correctly; added verified endpoint commands to `docs/max/deployment.md` | `API 09` and `API 16` are now unblocked |
| 2026-03-08 | Person B (Ayush) | Recovery sync, API 09, API 15, TST 04, TST 05 | Recovered the lost env or server or client or test bundle from unreachable git objects and re-synced the deployment/testing trackers to the validated repo state | The branch had rolled back to `5538ba0`, which left the working code, deployment metadata, and tracker files out of sync even though the recovered code now passes 231 tests, Docker validation, and OpenEnv validation | Restored the missing runtime files, revalidated the real env and Docker path, recorded HF Space metadata tasks (`API 09`, `API 15`) as complete, and closed the two reward-regression tests (`TST 04`, `TST 05`) that are already covered in `tests/test_reward.py` | Live HF Space bring-up remains `API 10` |
| 2026-03-08 | Person B (Ayush) | JDG 08 | Executed the task even though it was assigned to Person A | The judge stack needed stronger regression coverage before parallel training and deployment work fan out, and the current reward tests did not yet cover the most important ordering and edge-case scenarios explicitly | Added five focused `tests/test_reward.py` regressions covering good-vs-awful ordering across all judge axes and total reward, success-criteria sensitivity for rigor, partial equipment credit for feasibility, direct-match vs substitution vs miss ordering for fidelity, and reward-breakdown determinism with and without a precomputed feasibility check; full suite now passes at 264 tests | `JDG 06`, `AGT 09`, `SCN 13`, and `ENV 10` remain the next Kian-lane tasks |
| 2026-03-08 | Person B (Ayush) | MOD 06 | Completed the semantic impossibility validators even though the task was assigned to Person A | The dependency `MOD 05` was complete and the validators extend the same `validate_protocol()` function | Added `_check_semantic_impossibilities()` with five checks (zero sample with controls, controls >= sample size, duplicate controls/equipment/reagents) and seven new tests; all 223 non-live-server tests pass; valid protocols remain unaffected | `MOD 08` (unit tests for schemas and validators) is partially unblocked |
| 2026-03-08 | Person B (Ayush) | JDG 06 | Implemented the plain-English judge explanation layer even though the task was assigned to Person A | `JDG 05` was complete, the explanation function was fully deterministic and isolated, and finishing it immediately unblocked Ayush's `AGT 10` prompt-file task | Added `replicalab/scoring/explain.py`, exported `explain_reward(...)` through `replicalab.scoring`, and covered it with nine focused reward tests without changing any scoring math | `AGT 10` is now unblocked; `JDG 11` can now package the explanation into the final audit payload |
| 2026-03-08 | Person B (Ayush) | JDG 11 | Implemented the structured final audit payload even though the task was assigned to Person A | Both dependencies (`JDG 05`, `JDG 06`) were complete, and the audit builder is a pure deterministic formatter with no scoring changes | Created `replicalab/agents/judge_policy.py` with `JudgeAudit` model and `build_judge_audit()` builder; derives verdict, reuses `explain_reward()` for notes, extracts top failure reasons from weak components and penalty keys; exported through `replicalab.agents`; ten tests pass; 255 full suite passes | `ENV 11`, `UI 13`, and `OBS 09` are now unblocked |
| 2026-03-08 | Person B (Ayush) | SCN 13 and AGT 09 | Executed two Person A tasks to keep the Kian lane consistent with the implemented repo state | `SCN 13` was already implemented in the scenario layer and `AGT 09` was already implemented as deterministic Lab Manager regression coverage, but both were still left open in the tracker flow | Recorded `SCN 13` complete in the normalized scenario layer and `AGT 09` complete in the Lab Manager grounding test stack, bringing the source-of-truth backlog, completion rollup, and Kian owner docs back into sync with code and tests | `ENV 10` and `ENV 11` are now the remaining unblocked Kian-lane tasks |
| 2026-03-08 | Person B (Ayush) | ENV 11 | Finished the env-side audit integration on Person A's lane and closed the replay-state gap | The env already attached `judge_notes` and `verdict` to terminal `StepResult` and `EpisodeState`, but replay logs were still dropping `top_failure_reasons`, so the task was only partially complete against its own acceptance text | Added `top_failure_reasons` to the replay `EpisodeLog` build path in `server/app.py`, kept the canonical env audit source in `replicalab/env/replicalab_env.py`, and verified terminal audit payload survival through env tests and replay endpoint tests | `ENV 11` is now fully closed; Kian's only fully unblocked task is `ENV 10`, while `API 18` and `OBS 09` are each one dependency closer |
| 2026-03-08 | Person B (Ayush) | ENV 10 | Executed the deterministic replay and broader environment regression suite even though the task was assigned to Person A | The environment lifecycle and audit stack were complete, but the repo still needed proof that same seed plus same action sequence yields the same trajectory and final state across all supported families without depending on file-backed replay persistence | Added replay-determinism coverage to `tests/test_env.py` for same-seed initial observations, same-seed same-action trajectories, timeout determinism, invalid-action determinism, and terminal audit replay stability across math, ML, and finance families; full suite now passes at 327 tests | `OBS 04` is now unblocked, while `MOD 08` still waits on `MOD 07` |
| 2026-03-08 | Person B (Ayush) | OBS 04 | Closed the replay-observability test task on Person A's lane using the new deterministic env replay suite | `OBS 04` depends on `ENV 10`, and the completed `TestReplayDeterminism` block already proves same-seed same-action replay consistency across the full environment stack, so leaving the task open would only create tracker drift | Recorded `OBS 04` complete against the existing `tests/test_env.py` replay determinism coverage without adding redundant second-copy tests; the observability lane now has its deterministic replay guard in the env test suite | Kian has no fully unblocked implementation task left; `MOD 08` still waits on `MOD 07` |
| 2026-03-08 | Person B (Ayush) | AGT 10 | Implemented the role prompt files and loader helpers in code after the deterministic judge explanation layer landed | `AGT 10` was unblocked by `JDG 06`, and keeping the prompt source in versioned files was cleaner than scattering role text across notebook cells or inline string literals | Added `replicalab/prompts/scientist.txt`, `lab_manager.txt`, and `judge.txt` plus rendering helpers in `replicalab/prompts/__init__.py`, with six tests covering loadability, placeholder rendering, and bounded-tool rules | The role prompt bundle is now stable for notebooks, demos, and later model calls |
| 2026-03-08 | Person B (Ayush) | TRN 04 | Implemented the rollout collection loop as a reusable Python module rather than only inside a notebook | The backlog labels `TRN 04` as notebook work, but implementing it in `replicalab/training/rollout.py` makes the same rollout logic reusable across notebooks, tests, and future trainer code while preserving the required behavior | Extended `RolloutWorker` with terminal `StepInfo`, bounded tool trace aggregation, and `collect_rollouts(...)`; added trace and batch tests in `tests/test_rollout_traces.py` and kept the rollout logic fully testable outside a notebook | `TRN 05` is now unblocked and notebooks can import the rollout loop instead of reimplementing it |
| 2026-03-08 | Person B (Ayush) | API 14 | Completed the REST session isolation verification even though the task was assigned to Person C | The session isolation logic already worked correctly in `server/app.py`; the task was still marked partial because no dedicated tests proved concurrent-user isolation against the real env | Created `tests/test_api_rest_isolation.py` with 11 tests covering session independence, round-count isolation, terminal isolation, session_id reuse, invalid session handling, and replay isolation; no server changes needed; 307 tests pass | No new dependencies unblocked; `API 14` was the last partial API task besides `API 01` and `OBS 02` |
| 2026-03-08 | Person B (Ayush) | MOD 07 and MOD 10 | Closed the replay-persistence and schema-example tasks on Max's lane after verifying the code that had already landed | `replicalab/utils/logging.py` and the API example generator were implemented and passing tests, but the source-of-truth backlog and Max's owner docs still showed both tasks as not started, and the generated examples still contained stale stub audit text | Updated `tests/fixtures/generate_api_examples.py` to derive terminal judge metadata from the current deterministic judge helpers, regenerated `api_schema_examples.json`, and synced `MOD 07`/`MOD 10` to complete in the comprehensive backlog, completion rollup, and Max owner docs | `MOD 08` and `JDG 07` are now clearly unblocked in the tracked plan |
| 2026-03-08 | Person B (Ayush) | Reward shaping and rubric refinement | Expanded the reward system beyond terminal-only scoring without reopening the outer action or observation contract | Sparse terminal-only reward was too weak for RL training, and the project needed deterministic shaping rather than a frontier-model reward source | Added a parsimony term to terminal reward, introduced deterministic step shaping in `ReplicaLabEnv` (information gain, protocol delta, momentum, contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement signals), updated rollout aggregation to use cumulative episode reward, and aligned env/server tests to the new shaped-reward semantics while keeping the full suite green at 356 tests | Keep the notebook and training plots explicit about terminal reward components vs cumulative shaped episode reward |
| 2026-03-08 | Person B (Ayush) | Oracle hybrid architecture | Added an Oracle-style frontier-model layer as an additive integration instead of replacing the deterministic environment and reward stack | The sponsor-facing V2 direction calls for a model-driven intelligence layer woven through scenario generation, environment interaction, and explanation, but the RL training path still needs deterministic reward and reproducible evaluation | Added `oracle_models.py`, `oracle.py`, `cache.py`, Oracle prompt assets, an optional model-backed Lab Manager wrapper, an adapter from Oracle scenarios into the existing normalized scenario pack, and feature-flagged Oracle hooks in `ReplicaLabEnv`; kept deterministic scoring in `replicalab/scoring/*` as the canonical training reward; expanded test coverage with `test_oracle.py`, `test_cache.py`, and Oracle adapter/prompt tests; full suite now passes at 365 tests | If this grows beyond the current additive mode, record any future contract or reward-source changes separately before altering the deterministic training path |
| 2026-03-08 | Person B (Ayush) | Deployment access tooling | Added Northflank CLI installation verification and service-operation commands to `docs/max/deployment.md` even though the original deployment docs were HF-Space-centric | The active service now also needs a documented Northflank access path for forwarding, logs, shell access, and file transfer | Backend deployment docs now include the verified local CLI install (`northflank` 0.10.16), login command shape, and the `replica-labs` / `replicalab-ai` service commands | Actual login still requires a user-supplied account token outside the repo |
| 2026-03-08 | Person B (Ayush) | Local paper corpus for training and experiment design | Added a new local dataset under `data/papers/` sourced from `ReplicaLab_50_Scenarios_Training_Plan.md`, which is outside the original tracked backlog artifacts | The training-plan draft now calls for a 50-paper corpus to support experiment-design grounding, but many scenario titles are synthetic summaries rather than directly retrievable publication titles | Downloaded 50 open-access PDFs into `data/papers/<field>/<paper-name>/`, added per-paper metadata plus `data/papers/manifest.json`, and marked substitute papers explicitly when the exact scenario title could not be matched cleanly | If the team wants this corpus versioned in git or refreshed later, keep using the manifest as the source of truth for replacements and provenance |
| 2026-03-08 | Person B (Ayush) | MOD 08 | Completed the comprehensive schema and validator unit test task on Person A's lane | All MOD 01–07 dependencies were complete, and the task was the last remaining item in Kian's backlog | Created `tests/test_mod08_schemas.py` with 70 unit tests covering all Pydantic model edge cases across 11 test classes (ScientistAction, LabManagerAction, Protocol, ConversationEntry, RewardBreakdown, Observation, LabManagerObservation, StepInfo, StepResult, EpisodeState, EpisodeLog); full suite passes at 409 tests | Kian's lane is now 100% complete (49/49 tasks) |
| 2026-03-08 | Person B (Ayush) | JDG 07 | Closed the reward-breakdown logging task on Max's lane after verifying the implementation already meets all acceptance criteria | `append_reward_csv()`, `append_reward_jsonl()`, and `log_episode_reward()` were already implemented in `replicalab/utils/logging.py` with 22 tests in `tests/test_logging.py`; no code changes needed | Verified CSV column set (parsimony, bonuses, penalty total, verdict), JSONL nested penalty/bounded-tool preservation, determinism, and the dual-format convenience wrapper; marked JDG 07 complete in all three tracker files | `ENV 09` and `JDG 10` are now unblocked |
| 2026-03-08 | Person B (Ayush) | API 01 and OBS 02 | Closed the two remaining partial tasks on Max's lane after verifying both already exceed their acceptance criteria | API 01's health endpoint, full REST/WS server, and 34+11 endpoint tests were already passing; OBS 02's env-var log toggle and readable format were already wired in `config.py` and `server/app.py` | Verified and marked both tasks complete; no active partial tasks remain in the project | Max's next unblocked chain is `ENV 09 -> OBS 01 -> API 05` |
| 2026-03-08 | Person B (Ayush) | V2 training architecture | Implemented the training stack as reusable Python modules plus Northflank-friendly job entrypoints instead of keeping the work notebook-only | The active runtime direction changed to Northflank H100 with persistent volumes, two first-class model artifacts, and a judged notebook that should stay thin and readable | Added `replicalab/training/{artifacts,corpus,datasets,runtime,scientist_grpo,lab_manager_sft,evaluation,metrics,plots,cli}.py`, added `replicalab-train` as a package script, created `notebooks/train_colab.ipynb` as the driver notebook, and added focused training tests | Remaining work is real-run validation (`TRN 05`), notebook-facing metric finalization (`JDG 10`, `TRN 06`), and trained-adapter evaluation wiring (`TRN 08`, `TRN 09`, `TRN 15`) |
| 2026-03-08 | Person B (Ayush) | ENV 09, OBS 01, OBS 03, OBS 07, OBS 09, API 05, API 11, API 18, TST 06, TST 11 | Executed ten Person C (Max) tasks as a batch to close out the logging, replay, observability, API endpoint, and testing gaps | Max's remaining backend chain was blocking downstream UI, notebook, and submission tasks, and Person B had already implemented most of the underlying code in prior commits | ENV 09: added `write_episode_log()` and `log_episode_reward()` calls to REST and WS step handlers for auto-persisting replay JSON and reward CSV/JSONL. OBS 09: added `invalid_action_count` and `invalid_action_rate` fields to `EpisodeLog`. OBS 07: created `scripts/run_episode.py` for one-command local episode dumps. TST 11: created `tests/test_audit_contract.py` with 17 contract tests. API 05, API 11, API 18, OBS 01, OBS 03, TST 06: verified already-implemented code against acceptance criteria and recorded as complete | Max's remaining tasks are `API 16`, `API 19`, `DOC 08`, and `UI 11` |
| 2026-03-08 | Person B (Ayush) | API 19 | Implemented the OpenEnv `/web` fallback route on Person C's lane | All dependencies (`FND 09`, `API 08`, `API 10`) were complete; the fallback was needed for demo resilience when the custom React UI is unavailable | Added a self-contained HTML/JS `/web` endpoint to `server/app.py` with interactive reset/propose/accept controls, scenario/seed/difficulty selection, negotiation log, score display, and raw response viewer; added `web_fallback: /web` to `openenv.yaml`; added 3 tests in `test_server.py`; 474 tests pass | Max's remaining tasks are `API 16`, `DOC 08`, and `UI 11` (all blocked on Kush frontend work) |
| 2026-03-08 | Person D (Kush) | UI 07 | Completed the REST plus WebSocket client helpers task | Kush pushed a full `frontend/src/lib/api.ts` rewrite with REST helpers (`healthCheck`, `resetEpisode`, `stepEpisode`, `getReplay`), WebSocket support (`createWebSocket`, `sendWsMessage`), backend-to-frontend type adapters, and default action builders | `UI 07` is now complete; `UI 11` is unblocked on this dependency | `UI 11` can now proceed once the integration is wired |
| 2026-03-08 | Person D (Kush) | API 16, UI 10, UI 11 | Completed frontend integration, styling, and Docker multi-stage build | Kush pushed multi-stage Dockerfile (Node frontend build into Python runtime), SPA static serving in `server/app.py`, and new frontend components (ProtocolEditor, AutoPlayControls, LiveScoreGauges, LabScene3D, AgentThoughts, EpisodeComparison, Onboarding, KeyboardShortcuts, Toast, confetti) | All three tasks complete; Max's lane reduced to `DOC 08` only | `DOC 08` was the last Max task |
| 2026-03-08 | Person B (Ayush) | DOC 08 | Verified repo hygiene on Person C's lane | All dependencies (`API 10`, `UI 10`, `TRN 10`) were now complete | Verified repo is public (`isPrivate: false`), `.env` is not tracked, no API key patterns in tracked files, `.gitignore` covers `.env`, and all required files exist (code, models, env, scoring, agents, server, frontend, Docker, tests, notebook, scripts, docs) | Max (Person C) is now 100% complete (41/41 tasks) |
| 2026-03-08 | Person B (Ayush) | ART/OpenEnv training runtime | Switched the active live RL execution path from the planned Northflank-heavy route to the already-working ART/OpenEnv serverless route for immediate training validation | The Northflank H100 job shape was documented and scaffolded, but the fastest path to real rollouts and trainer execution was the hosted ReplicaLab + OpenPipe ART integration that could be exercised immediately | Added `art-scientist-train`, live smoke runs, comparison-eval runs, run metadata, plots, evidence manifests, and process documentation; the training pipeline is now validated end to end against the live environment | Keep Northflank as the future heavy-run backend once the dedicated GPU job image and volume flow are ready |
| 2026-03-08 | Person B (Ayush) | TST 09 | Marked the notebook smoke-test task complete before `TRN 12` because the checklist and runtime validation are technical work, while `TRN 12` is a storytelling task | The smoke checklist was already written, and it was then executed end to end with fresh-runtime preview, live ART/OpenEnv training, and comparison-eval commands against frozen evidence packs | `TST 09` is now complete; Ayush's lane is fully closed, while Person D still owns the plain-English result bullets in `TRN 12` | Continue using the smoke checklist as the canonical fresh-runtime validation path for the judged notebook |
| 2026-03-08 | Person B (Ayush) | Frozen evidence-pack loading | Added a plan-derived fallback when the local `data/papers/manifest.json` corpus is absent | The paper corpus is intentionally not committed, but fresh-runtime training preview and test paths still need stable evidence packs instead of crashing on a missing manifest file | `replicalab/training/corpus.py` now synthesizes deterministic `plan_only` manifest entries from the 50-scenario training plan whenever the local paper manifest is missing; fresh-runtime preview, tests, and smoke commands now work without the local PDF corpus | Keep using the real local corpus when available; treat the plan-only path as a portability fallback, not the preferred evaluation corpus |
| 2026-03-08 | Person B (Ayush) | Minimal Colab sponsor asset | Added an explicit minimal Colab training notebook in addition to the fuller judged notebook | The hackathon requirement calls for a minimal Unsloth or HF TRL Colab script, and the repo previously only had the broader multi-step notebook plus a placeholder minimal file | `notebooks/train_minimal_colab.ipynb` now contains a real minimal Unsloth + HF TRL GRPO flow for ReplicaLab, and `tests/test_notebooks.py` guards that both notebook assets keep their intended roles | Keep the minimal notebook tiny and sponsor-facing; keep complex workflow details in `notebooks/train_colab.ipynb` |
| 2026-03-08 | Person B (Ayush) | Person D batch close-out: DOC 01-07, DOC 09-11, SCN 12, TRN 12, API 12, UI 01-06, UI 08-09, UI 12-15, FND 13, JDG 09, OBS 05, OBS 08, TST 08, TST 10, TST 12 | Closed 28 remaining Person D tasks in one batch to reach 152/152 (100%) | Kush had already built the full React frontend (14 of 15 UI tasks), and the doc/storytelling tasks were text work that could be completed from the existing README, demo script, recording guide, and smoke checklist | Enhanced README with replication-crisis hook (DOC 01), 4-option setup (DOC 03), key takeaways (DOC 04/TRN 12), /web fallback route (DOC 11), aligned scenario summaries (SCN 12). Created docs/submission_prep.md (DOC 09) and docs/pitch_outline.md (DOC 10). Verified Kush's frontend components against acceptance criteria for all UI tasks. Marked existing docs (demo_script.md, recording_guide.md, ui_smoke_checklist.md) against DOC 05-07, UI 12, TST 08, TST 12 | Project is now 100% complete across all 12 epics and 4 team members |
| 2026-03-08 | Person B (Ayush) | Frontend demo narrative refinement | Executed an additional frontend storytelling pass on Person D's lane after the backlog was already marked complete | The hackathon demo needs the UI to tell the paper-to-training story immediately, and the imported frontend still read as a generic episode runner in several places | Reframed the dashboard, episode page, paper panel, controls, training panel, and compare page around `source paper -> parsed brief -> negotiation -> deterministic judge -> training`, fixed strict TypeScript issues in imported UI components, refreshed `frontend/package-lock.json`, and verified the production build with `npm --prefix frontend run build` | Swap the packaged training-demo trace for live artifact data if a final run is ready before recording |
| 2026-03-08 | Person B (Ayush) | Frontend live episode policy | Adjusted the frontend auto-step action builder after local stack verification exposed a mismatch with backend baseline behavior | The demo UI was using a hard-coded generic proposal that could fail validation immediately on real scenarios, even though the backend and evaluator produced valid baseline runs | Made the frontend default Scientist proposal scenario-aware using live episode context (time limit, available resources, scenario family), rebuilt the frontend, and re-verified that a local ML episode now reaches a valid judged terminal result | If final recording depends on exact baseline numbers, keep using the local evaluation artifacts or wire the UI directly to saved summaries rather than relying on synthetic cards |
| 2026-03-08 | Person B (Ayush) | Frontend episode kickoff UX | Added explicit first-round call-to-action controls to the live episode view after user testing showed the page looked stuck immediately after reset | The reset state loaded the paper and constraints correctly, but both the step controls and protocol-editor submit action could sit below the fold, making the UI appear frozen at round 0 | Added an `Episode ready` banner plus in-panel `Advance First Round` and `Open Protocol Editor` actions, and updated the negotiation placeholder so it no longer says `Start an episode` after an episode is already active | Keep a visible first action near the negotiation panel in future layout changes |
| 2026-03-08 | Person B (Ayush) | One-click live demo automation | Extended the hero CTA flow so `Replicate a Paper` runs the seeded episode automatically instead of only opening the episode page | The hackathon demo needs a one-click narrative with live agent behavior and judge output; requiring manual reset and step clicks after entering the episode page weakens the demo | The dashboard now links to a seeded demo URL, the episode page auto-starts on demo routes, preserves demo query params in the shareable URL, and enables autoplay so the negotiation proceeds to judged completion with no extra clicks | If this is later generalized beyond the hero CTA, keep manual scenario-card entry as a separate non-demo path |
| 2026-03-08 | Person B (Ayush) | Frontend backend-availability diagnostics | Added a startup health check and contextual network-error messaging to the replication setup flow after the live demo surfaced a generic `Failed to fetch` banner | The page gave no actionable explanation when the local API server on port `7860` was not running, which made a recoverable environment issue look like an application bug | `Controls.tsx` now checks backend health on load, `frontend/src/lib/api.ts` rewrites fetch-network failures into an explicit uvicorn startup instruction, the frontend was rebuilt, and the local FastAPI server was restarted and verified healthy on `127.0.0.1:7860` | Keep the backend running during demo prep or use the integrated backend-served frontend on `http://127.0.0.1:7860` |
| 2026-03-08 | Person B (Ayush) | Frontend live demo outcomes and results report | Extended the one-click demo into three seeded story modes with a detailed post-episode report instead of a single autoplay path that stopped at the generic episode view | The hackathon demo needs to show distinct judged outcomes: immediate agreement, multi-round learning opportunity, and failure to reach agreement, all backed by real episode values rather than static mock copy | Added `fast-agreement`, `learning-opportunity`, and `no-agreement` dashboard launches; routed episode autoplay through scripted but backend-valid action sequences; added a results report with live reward charts, terminal score bars, training interpretation, reliability labeling, and tool-install suggestions; rebuilt the frontend and verified all three seeded ML runs against the live backend | If Oracle-backed narrative summaries are added later, keep the deterministic judge verdict and real score traces as the source of truth for the report |
| 2026-03-08 | Person B (Ayush) | Frontend training-status reporting | Replaced the packaged training teaser with an artifact-backed training page and honest improvement guidance | The demo needs a place to show training logs, real achieved values, and whether more training is still required; the earlier dashboard card implied progress but did not expose the real run outputs | Added a dedicated `/training` route, header navigation, a shared frontend data module sourced from real run summaries, and a new training page with checkpoint charts, compare bars, log highlights, preview-artifact status, and explicit `needs more training` analysis; rebuilt the frontend and reverified backend serving on `/training` | If future runs improve beyond baseline, update the shared training artifact data module first so the dashboard and training page stay consistent |
| 2026-03-08 | Person B (Ayush) | Demo video generation | Added a reproducible local builder for the one-minute demo video instead of relying on manual screen recording only | The demo now needs a fast, repeatable way to regenerate the final video with current UI states, a fresh voiceover, and ffmpeg assembly whenever the frontend story changes | Added `scripts/build_demo_video.py` plus `docs/demo_video_script_60s.md`; the script reads the ElevenLabs key from `.env`, captures the real dashboard, episode, and training screens with Selenium, synthesizes the voiceover, writes subtitles, and builds `replicalab/outputs/demo_video/replicalab_demo_60s.mp4` | If the narration or demo scenes change, rerun `python scripts/build_demo_video.py` to regenerate the assets from the current app state |
| 2026-03-08 | Person B (Ayush) | Hugging Face Space deployment | Redeployed the live HF Space from the current local app state after the hosted URL was serving an old backend-only container | The Space repo and runtime SHAs had drifted behind the local `master` branch, so the public URL showed the API landing page instead of the React app even though the repo already contained the multi-stage Docker build and SPA-serving server code | Synced the deployment files to `ayushozha/replicalab` through the Hugging Face API, restarted the Space, and verified that `https://ayushozha-replicalab.hf.space/` now serves the built frontend while `/health` still reports the real environment | If the Space serves the API-only page again, compare the Space repo SHA and runtime SHA first before assuming the frontend build is broken |
| 2026-03-08 | Person B (Ayush) | Frontend policy-results clarification | Added a separate baseline-vs-trained-vs-oracle page and clarified that the current public compare bench is still running the deterministic live runtime | The existing `/compare` page looked like a model-policy comparison, but it actually replays seeded benchmark episodes with the default Scientist action builder plus deterministic backend logic, which was confusing for the demo narrative | Added `/policies` as a dedicated policy-results page with live/runtime status, baseline vs trained artifact values, and an explicit oracle-not-mounted status; updated the header navigation and added a runtime clarification callout plus deep link on `/compare` | Keep `/compare` focused on seeded scenario benchmarking and use `/policies` when the audience asks whether the current app is actually running a trained or oracle-backed model |
| 2026-03-08 | Person B (Ayush) | Localhost model-driven Scientist runtime | Added a backend-selected Scientist runtime path for localhost episodes and switched the live local mode from the blocked Anthropic path to Ollama | The repo needed a real localhost model-driven flow rather than the frontend default action builder, but the current Anthropic account cannot make live API calls because its credit balance is exhausted | Added `/runtime` and `/agent-step`, wired Anthropic and Ollama Scientist backends, made non-demo episode stepping prefer the backend model path, added a deterministic safety adapter plus baseline fallback for fragile local generations, and verified live localhost stepping with `glm-5:cloud` through Ollama | If Anthropic credits are replenished later, restart the backend with `REPLICALAB_SCIENTIST_RUNTIME=anthropic` to use that path instead |
| 2026-03-08 | Person B (Ayush) | Frontend live-run randomness and judge semantics | Changed the default dashboard live run from one fixed scripted scenario to a random seeded paper episode, and split accepted-with-weaknesses presentation from outright failure presentation | The main demo CTA kept launching the same fixed `fast-agreement` route, which made the product feel canned, and the judge UI was showing `Accept` alongside `Failure Reasons`, which looked contradictory even though the backend semantics were agreement-based | The hero CTA now generates a fresh live route per click, the fixed outcome cards are explicitly labeled scripted, and accepted verdicts with residual gaps now render as `Accept with caveats` / `Conditional` instead of green accept plus red failure messaging | If the team later changes backend verdict semantics, keep the UI wording aligned so agreement and replicability remain separate concepts |
| 2026-03-08 | Person B (Ayush) | Frontend caveat-state consistency | Tightened the remaining frontend success states so caveated accepts no longer behave like clean wins | After the judge-panel wording fix, the stage animation, first-round “good paper” label, and completion toast could still celebrate an accepted-but-weak protocol as a full success | `CharacterStage`, `EpisodePage`, and `EpisodeResultsReport` now treat accepted-with-caveats runs as partial outcomes, and live reset checks confirmed the dynamic route surfaces distinct paper briefs across scenario families when using the real reset contract | Keep any future verdict-label changes aligned across audit copy, stage emotion, toasts, and post-episode summaries |
| 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
| 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
| 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
| 2026-03-09 | Person B (Ayush) | H100 paper-understanding benchmark | Shifted the active H100 benchmark from a planned full multi-round rollout sweep to a first-step live environment benchmark on the same notebook | The current notebook image lacks the fast linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so repeated sharded `scientist-local-compare-eval` attempts stayed active for a long time without producing same-turn artifacts even after retry and token-budget cuts | Produced a merged live H100 benchmark artifact set at `replicalab/outputs/training/h100-one-step-500-20260309/` covering `500` total simulations (`250` shared reset cases × baseline/trained first-step actions); the current saved adapter underperformed badly versus the deterministic baseline on first-step paper understanding and collapsed to `request_info` on every trained sample | If a full multi-round benchmark is still required later, first fix the notebook image to restore the fast attention path or move the eval to a more efficient runtime |