ayushozha commited on
Commit
8a624de
Β·
1 Parent(s): 81312b4

Complete FND 01 and FND 10, update task division with status tracking

Browse files

- FND 01: Add repo scaffold with all top-level folders and subfolders
- FND 10: Add replicalab/outputs/ with logs, replays, plots subdirs
- Add status and completed-by columns to all epic task tables
- Add status legend to epic backlog section
- Add Section 4.1 training compute availability (H100)
- Mark FND 01 and FND 10 as completed

ReplicaLab_Comprehensive_Task_Division.md CHANGED
@@ -96,6 +96,13 @@ By judging time, the project should demonstrate:
96
  | Storytelling | everyone contributes screenshots, gifs, examples |
97
  | Submission readiness | all four review final demo, notebook, README, repo visibility |
98
 
 
 
 
 
 
 
 
99
  ---
100
 
101
  ## 5. Module and function ownership map
@@ -183,6 +190,14 @@ Every PR must include:
183
 
184
  ## 8. Epic backlog
185
 
 
 
 
 
 
 
 
 
186
  ---
187
 
188
  ## Epic E01. Foundations and repository setup
@@ -190,6 +205,17 @@ Every PR must include:
190
  ### Epic goal
191
  Create a stable shared codebase, contracts, and development workflow so all workstreams can proceed in parallel.
192
 
 
 
 
 
 
 
 
 
 
 
 
193
  ### User stories
194
 
195
  **US E01.1**
@@ -200,21 +226,21 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
200
 
201
  ### Tasks
202
 
203
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
204
- | --- | --- | --- | --- | --- | --- | --- | --- |
205
- | FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly |
206
- | FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules |
207
- | FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully |
208
- | FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models |
209
- | FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files |
210
- | FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes |
211
- | FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields |
212
- | FND 08 | E01.2 | Person A and B | docs or backlog file | Freeze JSON contract for actions and observations | FND 04 | 0.75h | all owners sign off and no blocking contract ambiguity remains |
213
- | FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file |
214
- | FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git |
215
- | FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml |
216
- | FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging |
217
- | FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors |
218
 
219
  ---
220
 
@@ -233,20 +259,20 @@ As the training loop, I need deterministic state serialization so episodes can b
233
 
234
  ### Tasks
235
 
236
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
237
- | --- | --- | --- | --- | --- | --- | --- | --- |
238
- | MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors |
239
- | MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors |
240
- | MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys |
241
- | MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works |
242
- | MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons |
243
- | MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases |
244
- | MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss |
245
- | MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization |
246
- | MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error |
247
- | MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads |
248
- | MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape |
249
- | MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code |
250
 
251
  ---
252
 
@@ -265,21 +291,21 @@ As a judge, I want diverse but believable constraints so the environment tests r
265
 
266
  ### Tasks
267
 
268
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
269
- | --- | --- | --- | --- | --- | --- | --- | --- |
270
- | SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env |
271
- | SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define common scenario schema with paper, lab constraints, and hidden rubric sections | MOD 04 | 0.75h | all scenario builders return the same top level structure |
272
- | SCN 03 | E03.2 | Person A | `replicalab/scenarios/cell_biology.py` | Implement cell biology template with required controls, equipment, and reagent rules | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
273
- | SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with GPU, time, and baseline constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
274
- | SCN 05 | E03.2 | Person A | `replicalab/scenarios/behavioral_psych.py` | Implement behavioral psychology survey template with participant, budget, and ethics placeholders | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
275
- | SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard | SCN 03 to SCN 05 | 1h | difficulty visibly changes budget or availability in a meaningful way |
276
- | SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement lab constraint generator for budget, time limit, staff, stock, bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints |
277
- | SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement minimum viable replication spec per template | SCN 03 to SCN 05 | 1h | hidden rubric clearly marks what is fixed versus flexible |
278
- | SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content |
279
- | SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary |
280
- | SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing |
281
- | SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges |
282
- | SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement equipment booking calendar data model with time slot availability, conflict detection, and booking duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts and the Lab Manager can check calendar availability |
283
 
284
  ---
285
 
@@ -298,19 +324,19 @@ As the Lab Manager, I want deterministic feasibility checks so the environment r
298
 
299
  ### Tasks
300
 
301
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
302
- | --- | --- | --- | --- | --- | --- | --- | --- |
303
- | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft system prompt for Scientist role | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, constraints, and JSON output contract |
304
- | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper | AGT 01, MOD 03 | 0.75h | formatted prompt includes paper info, history, and action schema consistently |
305
- | AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure |
306
- | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing |
307
- | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement feasibility checker against budget, equipment, reagents, schedule, personnel | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension |
308
- | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic such as substitute technique or smaller sample size | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails |
309
- | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add human readable response templating from feasibility results | AGT 05 | 0.75h | output is stable, readable, and maps cleanly to underlying checks |
310
- | AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling |
311
- | AGT 09 | E04.2 | Person A | tests | Add deterministic policy tests for Lab Manager | AGT 05 to AGT 07 | 0.75h | same proposal plus same lab state returns same response every time |
312
- | AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and match the agreed role behavior |
313
- | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned |
314
 
315
  ---
316
 
@@ -329,19 +355,19 @@ As a judge, I need a readable score breakdown so I can understand why the enviro
329
 
330
  ### Tasks
331
 
332
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
333
- | --- | --- | --- | --- | --- | --- | --- | --- |
334
- | JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor score for sample size, controls, method, stats, duration | SCN 08 | 1.25h | score is between 0 and 1 and matches rubric examples |
335
- | JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, equipment, reagents, time, staffing | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches lab constraint logic |
336
- | JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score for sample ratio, technique match, control completeness | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples |
337
- | JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output |
338
- | JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores and penalties | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, and penalties |
339
- | JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric and introduces no new hidden logic |
340
- | JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement |
341
- | JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering |
342
- | JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data |
343
- | JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity over time |
344
- | JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI |
345
 
346
  ---
347
 
@@ -363,19 +389,19 @@ As a judge, I want deterministic replay and cleanup.
363
 
364
  ### Tasks
365
 
366
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
367
- | --- | --- | --- | --- | --- | --- | --- | --- |
368
- | ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors |
369
- | ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state |
370
- | ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application | ENV 02, AGT 05 | 1h | valid Scientist action updates state and history correctly |
371
- | ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step | ENV 03, AGT 07 | 1h | lab manager response is appended and returned in the next observation |
372
- | ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit |
373
- | ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward and breakdown info |
374
- | ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay |
375
- | ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw |
376
- | ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically |
377
- | ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency |
378
- | ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema |
379
 
380
  ---
381
 
@@ -394,27 +420,27 @@ As the team, we want one click reproducible deployment to HF Spaces.
394
 
395
  ### Tasks
396
 
397
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
398
- | --- | --- | --- | --- | --- | --- | --- | --- |
399
- | API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload |
400
- | API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation |
401
- | API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result |
402
- | API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties |
403
- | API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id |
404
- | API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step |
405
- | API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak |
406
- | API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 |
407
- | API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment |
408
- | API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode |
409
- | API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect |
410
- | API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available |
411
- | API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors |
412
- | API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state |
413
- | API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata |
414
- | API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 |
415
- | API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for Scientist LLM access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets |
416
- | API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes` and verdict fields without separate log file access |
417
- | API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable |
418
 
419
  ---
420
 
@@ -433,23 +459,23 @@ As the team, we want a repeatable evaluation workflow for before versus after co
433
 
434
  ### Tasks
435
 
436
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
437
- | --- | --- | --- | --- | --- | --- | --- | --- |
438
- | TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order |
439
- | TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets |
440
- | TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env |
441
- | TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, and done signals |
442
- | TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors |
443
- | TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, and rounds used | JDG 10, TRN 04 | 0.75h | notebook stores metrics frame across training episodes |
444
- | TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file |
445
- | TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios |
446
- | TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly |
447
- | TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use |
448
- | TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes |
449
- | TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English |
450
- | TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic |
451
- | TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained |
452
- | TRN 15 | E08.2 | Person B | notebook | Add agreement rate and invalid action rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, and invalid action rate for baseline and trained runs |
453
 
454
  ---
455
 
@@ -468,23 +494,23 @@ As a team, we want a replayable UI for debugging and recording the demo.
468
 
469
  ### Tasks
470
 
471
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
472
- | --- | --- | --- | --- | --- | --- | --- | --- |
473
- | UI 01 | E09.1 | Person D | `frontend/src/App.tsx` | Create application shell with three panel layout | FND 03 | 0.75h | app renders layout for paper, conversation, and scoring panels |
474
- | UI 02 | E09.1 | Person D | `frontend/src/components/PaperPanel.tsx` | Build original paper summary panel | SCN 12 | 0.75h | panel displays title, hypothesis, method, key finding, and seed |
475
- | UI 03 | E09.1 | Person D | `frontend/src/components/ProtocolPanel.tsx` | Build current protocol and diff panel | JDG 09 | 1h | panel highlights current plan fields and updates after each round |
476
- | UI 04 | E09.1 | Person D | `frontend/src/components/NegotiationLog.tsx` | Build chat style negotiation log | API 03 or API 06 | 1h | scientist and lab manager messages show in correct order with role styling |
477
- | UI 05 | E09.1 | Person D | `frontend/src/components/ScorePanel.tsx` | Build rigor, feasibility, fidelity, and total score cards | JDG 09 | 0.75h | score cards render component values and penalties clearly |
478
- | UI 06 | E09.2 | Person D | `frontend/src/components/Controls.tsx` | Build new episode, seed input, scenario selector, and start controls | API 02, API 04 | 0.75h | user can start a chosen scenario with chosen seed from UI |
479
- | UI 07 | E09.2 | Person D | `frontend/src/lib/api.ts` | Add REST plus WebSocket client helpers | API 02 to API 06 | 0.75h | UI can connect locally and to the hosted Space |
480
- | UI 08 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Build replay viewer from completed episode logs | API 05 | 1h | user can load a past episode and step through rounds |
481
- | UI 09 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` | Add before versus after panel or static result card | TRN 10 | 0.75h | UI can show reward curve image and summary metrics |
482
- | UI 10 | E09.1 | Person D | frontend styling | Add clean visual styling with Tailwind plus shadcn compatible primitives and responsive spacing | UI 01 to UI 09, FND 13 | 0.75h | UI is presentable on demo screen without layout breaks and styling stack matches the declared toolchain |
483
- | UI 11 | E09.2 | Person C | integration | Serve frontend with backend or configure proxy during dev | UI 07, API 01 | 0.5h | one command local dev works and deployed app serves UI path |
484
- | UI 12 | E09.2 | Person D | tests and smoke | Add smoke test checklist for core UI flow | UI 01 to UI 11 | 0.5h | checklist confirms new episode, step, score update, and replay all work |
485
- | UI 13 | E09.1 | Person D | `frontend/src/components/JudgeAuditPanel.tsx` or `NegotiationLog.tsx` | Render final Judge audit text and verdict at episode end | JDG 11, API 18 | 0.75h | UI shows a clear end of episode audit without hiding the deterministic score breakdown |
486
- | UI 14 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Add replay slider or scrubber so judges can move across rounds quickly | UI 08 | 0.5h | user can scrub to any round without replaying the full episode sequentially |
487
- | UI 15 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` and `Controls.tsx` | Add before versus after training toggle for baseline versus trained views in the demo UI | UI 06, UI 09, TRN 15 | 0.5h | judges can switch between baseline and trained result summaries from the UI |
488
 
489
  ---
490
 
@@ -503,17 +529,17 @@ As a judge, I want the same seeded scenario to be replayable.
503
 
504
  ### Tasks
505
 
506
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
507
- | --- | --- | --- | --- | --- | --- | --- | --- |
508
- | OBS 01 | E10.1 | Person C | `replicalab/utils/logging.py` | Standardize episode log schema for transcript, state snapshots, and scores | ENV 09 | 0.5h | every completed episode log contains the same required fields |
509
- | OBS 02 | E10.1 | Person C | logging config | Add local log levels and readable console formatting | API 01 | 0.5h | debug logs can be toggled without code edits |
510
- | OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate |
511
- | OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence |
512
- | OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode |
513
- | OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility |
514
- | OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log |
515
- | OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo |
516
- | OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README |
517
 
518
  ---
519
 
@@ -532,20 +558,20 @@ As a judge, I want the system to work reliably when clicked live.
532
 
533
  ### Tasks
534
 
535
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
536
- | --- | --- | --- | --- | --- | --- | --- | --- |
537
- | TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations |
538
- | TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape |
539
- | TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives |
540
- | TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol |
541
- | TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected |
542
- | TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally |
543
- | TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated |
544
- | TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes |
545
- | TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits |
546
- | TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day |
547
- | TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics |
548
- | TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready |
549
 
550
  ---
551
 
@@ -564,19 +590,19 @@ As the team, we want all submission requirements complete and polished.
564
 
565
  ### Tasks
566
 
567
- | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
568
- | --- | --- | --- | --- | --- | --- | --- | --- |
569
- | DOC 01 | E12.1 | Person D | `README.md` | Write hook, problem statement, and one line product summary | FND 06 | 0.75h | README opening clearly explains the replication crisis and ReplicaLab solution |
570
- | DOC 02 | E12.1 | Person D | `README.md` | Add architecture diagram and environment loop explanation | ENV 06, API 10 | 1h | diagram matches actual code and can be understood in under ten seconds |
571
- | DOC 03 | E12.1 | Person D | `README.md` | Add setup instructions for local run, Docker, HF Space, and Colab | API 10, TRN 11 | 0.75h | new user can follow setup without asking the team for hidden steps |
572
- | DOC 04 | E12.1 | Person D | `README.md` | Add results section with reward curve and before versus after comparison | TRN 10, TRN 12 | 0.75h | README includes at least one figure and one concrete improvement statement |
573
- | DOC 05 | E12.2 | Person D | demo script | Write one minute demo script with time coded scenes | UI 10, TRN 12 | 0.5h | demo script fits within one minute and covers problem, environment, and result |
574
- | DOC 06 | E12.2 | Person D | demo assets | Capture screen recording clips and narration or captions | DOC 05 | 1h | raw footage covers all key scenes and is visually clear |
575
- | DOC 07 | E12.2 | Person D | final video | Edit and upload final one minute YouTube demo | DOC 06 | 1h | video is public or unlisted, shareable, and under the time limit |
576
- | DOC 08 | E12.2 | Person C | repo hygiene | Verify repo is public and all required files are committed | API 10, UI 10, TRN 10 | 0.25h | public repo contains code, notebook, docs, and no secret leakage |
577
- | DOC 09 | E12.2 | all | submission form prep | Prepare final submission links and partner track selections | DOC 07, DOC 08 | 0.5h | all submission fields have final links and verified accessibility |
578
- | DOC 10 | E12.2 | all | dry run | Run final three minute pitch plus two minute Q and A rehearsal | DOC 09 | 0.75h | team can explain tracks, reward, architecture, and results confidently |
579
- | DOC 11 | E12.1 | Person D | `README.md` | Add evaluation summary table for average reward, rounds to agreement, invalid action rate, agreement rate, and note the `/web` fallback route as backup demo path | DOC 03, DOC 04, TRN 15, API 19 | 0.5h | README results and setup sections reflect all promised metrics and clearly document the fallback demo route |
580
 
581
  ---
582
 
 
96
  | Storytelling | everyone contributes screenshots, gifs, examples |
97
  | Submission readiness | all four review final demo, notebook, README, repo visibility |
98
 
99
+ ## 4.1 Training compute availability
100
+
101
+ 1. The team has access to an H100 GPU for heavier Scientist training and evaluation runs.
102
+ 2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
103
+ 3. The judged artifact remains the Colab notebook, so any H100 run must still have a documented notebook path or reduced scale fallback that can be shown in Colab.
104
+ 4. Person C supports any environment URL, secret, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
105
+
106
  ---
107
 
108
  ## 5. Module and function ownership map
 
190
 
191
  ## 8. Epic backlog
192
 
193
+ ### Status legend
194
+
195
+ - `βœ… Completed`
196
+ - `❌ Failed`
197
+ - `🟑 Partial`
198
+ - `⬜ Not started`
199
+ - `Completed by`: fill this only when the finisher is different from the assigned owner; otherwise use `β€”`
200
+
201
  ---
202
 
203
  ## Epic E01. Foundations and repository setup
 
205
  ### Epic goal
206
  Create a stable shared codebase, contracts, and development workflow so all workstreams can proceed in parallel.
207
 
208
+ ### Current status
209
+
210
+ - `FND 01` status: completed on 2026-03-07
211
+ - `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
212
+ - `FND 10` status: completed on 2026-03-07
213
+ - `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
214
+ - Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
215
+ - Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
216
+ - Remaining work now unblocked by `FND 01`: `FND 02`, `FND 03`, `FND 04`, `FND 05`, `FND 06`, `FND 07`
217
+ - Remaining Epic E01 work still gated by follow-on dependencies: `FND 08`, `FND 09`, `FND 11`, `FND 12`, `FND 13`
218
+
219
  ### User stories
220
 
221
  **US E01.1**
 
226
 
227
  ### Tasks
228
 
229
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
230
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
231
+ | FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | βœ… Completed | Person B (Ayush) |
232
+ | FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ⬜ Not started | β€” |
233
+ | FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ⬜ Not started | β€” |
234
+ | FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ⬜ Not started | β€” |
235
+ | FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ⬜ Not started | β€” |
236
+ | FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ⬜ Not started | β€” |
237
+ | FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | ⬜ Not started | β€” |
238
+ | FND 08 | E01.2 | Person A and B | docs or backlog file | Freeze JSON contract for actions and observations | FND 04 | 0.75h | all owners sign off and no blocking contract ambiguity remains | ⬜ Not started | β€” |
239
+ | FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | ⬜ Not started | β€” |
240
+ | FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | βœ… Completed | Person B (Ayush) |
241
+ | FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | ⬜ Not started | β€” |
242
+ | FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | ⬜ Not started | β€” |
243
+ | FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | ⬜ Not started | β€” |
244
 
245
  ---
246
 
 
259
 
260
  ### Tasks
261
 
262
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
263
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
264
+ | MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | ⬜ Not started | β€” |
265
+ | MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | ⬜ Not started | β€” |
266
+ | MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | ⬜ Not started | β€” |
267
+ | MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works | ⬜ Not started | β€” |
268
+ | MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons | ⬜ Not started | β€” |
269
+ | MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | ⬜ Not started | β€” |
270
+ | MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | ⬜ Not started | β€” |
271
+ | MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | ⬜ Not started | β€” |
272
+ | MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error | ⬜ Not started | β€” |
273
+ | MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | ⬜ Not started | β€” |
274
+ | MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape | ⬜ Not started | β€” |
275
+ | MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code | ⬜ Not started | β€” |
276
 
277
  ---
278
 
 
291
 
292
  ### Tasks
293
 
294
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
295
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
296
+ | SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env | ⬜ Not started | β€” |
297
+ | SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define common scenario schema with paper, lab constraints, and hidden rubric sections | MOD 04 | 0.75h | all scenario builders return the same top level structure | ⬜ Not started | β€” |
298
+ | SCN 03 | E03.2 | Person A | `replicalab/scenarios/cell_biology.py` | Implement cell biology template with required controls, equipment, and reagent rules | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ⬜ Not started | β€” |
299
+ | SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with GPU, time, and baseline constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ⬜ Not started | β€” |
300
+ | SCN 05 | E03.2 | Person A | `replicalab/scenarios/behavioral_psych.py` | Implement behavioral psychology survey template with participant, budget, and ethics placeholders | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ⬜ Not started | β€” |
301
+ | SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard | SCN 03 to SCN 05 | 1h | difficulty visibly changes budget or availability in a meaningful way | ⬜ Not started | β€” |
302
+ | SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement lab constraint generator for budget, time limit, staff, stock, bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints | ⬜ Not started | β€” |
303
+ | SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement minimum viable replication spec per template | SCN 03 to SCN 05 | 1h | hidden rubric clearly marks what is fixed versus flexible | ⬜ Not started | β€” |
304
+ | SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content | ⬜ Not started | β€” |
305
+ | SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary | ⬜ Not started | β€” |
306
+ | SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing | ⬜ Not started | β€” |
307
+ | SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | ⬜ Not started | β€” |
308
+ | SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement equipment booking calendar data model with time slot availability, conflict detection, and booking duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts and the Lab Manager can check calendar availability | ⬜ Not started | β€” |
309
 
310
  ---
311
 
 
324
 
325
  ### Tasks
326
 
327
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
328
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
329
+ | AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft system prompt for Scientist role | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, constraints, and JSON output contract | ⬜ Not started | β€” |
330
+ | AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper | AGT 01, MOD 03 | 0.75h | formatted prompt includes paper info, history, and action schema consistently | ⬜ Not started | β€” |
331
+ | AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ⬜ Not started | β€” |
332
+ | AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ⬜ Not started | β€” |
333
+ | AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement feasibility checker against budget, equipment, reagents, schedule, personnel | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ⬜ Not started | β€” |
334
+ | AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic such as substitute technique or smaller sample size | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ⬜ Not started | β€” |
335
+ | AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add human readable response templating from feasibility results | AGT 05 | 0.75h | output is stable, readable, and maps cleanly to underlying checks | ⬜ Not started | β€” |
336
+ | AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | ⬜ Not started | β€” |
337
+ | AGT 09 | E04.2 | Person A | tests | Add deterministic policy tests for Lab Manager | AGT 05 to AGT 07 | 0.75h | same proposal plus same lab state returns same response every time | ⬜ Not started | β€” |
338
+ | AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and match the agreed role behavior | ⬜ Not started | β€” |
339
+ | AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ⬜ Not started | β€” |
340
 
341
  ---
342
 
 
355
 
356
  ### Tasks
357
 
358
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
359
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
360
+ | JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor score for sample size, controls, method, stats, duration | SCN 08 | 1.25h | score is between 0 and 1 and matches rubric examples | ⬜ Not started | β€” |
361
+ | JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, equipment, reagents, time, staffing | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches lab constraint logic | ⬜ Not started | β€” |
362
+ | JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score for sample ratio, technique match, control completeness | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples | ⬜ Not started | β€” |
363
+ | JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output | ⬜ Not started | β€” |
364
+ | JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores and penalties | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, and penalties | ⬜ Not started | β€” |
365
+ | JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric and introduces no new hidden logic | ⬜ Not started | β€” |
366
+ | JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement | ⬜ Not started | β€” |
367
+ | JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | ⬜ Not started | β€” |
368
+ | JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | ⬜ Not started | β€” |
369
+ | JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity over time | ⬜ Not started | β€” |
370
+ | JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | ⬜ Not started | β€” |
371
 
372
  ---
373
 
 
389
 
390
  ### Tasks
391
 
392
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
393
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
394
+ | ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | ⬜ Not started | β€” |
395
+ | ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | ⬜ Not started | β€” |
396
+ | ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application | ENV 02, AGT 05 | 1h | valid Scientist action updates state and history correctly | ⬜ Not started | β€” |
397
+ | ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step | ENV 03, AGT 07 | 1h | lab manager response is appended and returned in the next observation | ⬜ Not started | β€” |
398
+ | ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | ⬜ Not started | β€” |
399
+ | ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward and breakdown info | ⬜ Not started | β€” |
400
+ | ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | ⬜ Not started | β€” |
401
+ | ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | ⬜ Not started | β€” |
402
+ | ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | ⬜ Not started | β€” |
403
+ | ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | ⬜ Not started | β€” |
404
+ | ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | ⬜ Not started | β€” |
405
 
406
  ---
407
 
 
420
 
421
  ### Tasks
422
 
423
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
424
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
425
+ | API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | ⬜ Not started | β€” |
426
+ | API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | ⬜ Not started | β€” |
427
+ | API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | ⬜ Not started | β€” |
428
+ | API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | ⬜ Not started | β€” |
429
+ | API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | ⬜ Not started | β€” |
430
+ | API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | ⬜ Not started | β€” |
431
+ | API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | ⬜ Not started | β€” |
432
+ | API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | ⬜ Not started | β€” |
433
+ | API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | ⬜ Not started | β€” |
434
+ | API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | ⬜ Not started | β€” |
435
+ | API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | ⬜ Not started | β€” |
436
+ | API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | ⬜ Not started | β€” |
437
+ | API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | ⬜ Not started | β€” |
438
+ | API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | ⬜ Not started | β€” |
439
+ | API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | ⬜ Not started | β€” |
440
+ | API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | ⬜ Not started | β€” |
441
+ | API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for Scientist LLM access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | ⬜ Not started | β€” |
442
+ | API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes` and verdict fields without separate log file access | ⬜ Not started | β€” |
443
+ | API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | ⬜ Not started | β€” |
444
 
445
  ---
446
 
 
459
 
460
  ### Tasks
461
 
462
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
463
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
464
+ | TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order | ⬜ Not started | β€” |
465
+ | TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | ⬜ Not started | β€” |
466
+ | TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env | ⬜ Not started | β€” |
467
+ | TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, and done signals | ⬜ Not started | β€” |
468
+ | TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors | ⬜ Not started | β€” |
469
+ | TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, and rounds used | JDG 10, TRN 04 | 0.75h | notebook stores metrics frame across training episodes | ⬜ Not started | β€” |
470
+ | TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | ⬜ Not started | β€” |
471
+ | TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios | ⬜ Not started | β€” |
472
+ | TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | ⬜ Not started | β€” |
473
+ | TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | ⬜ Not started | β€” |
474
+ | TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | ⬜ Not started | β€” |
475
+ | TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | ⬜ Not started | β€” |
476
+ | TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | ⬜ Not started | β€” |
477
+ | TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | ⬜ Not started | β€” |
478
+ | TRN 15 | E08.2 | Person B | notebook | Add agreement rate and invalid action rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, and invalid action rate for baseline and trained runs | ⬜ Not started | β€” |
479
 
480
  ---
481
 
 
494
 
495
  ### Tasks
496
 
497
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
498
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
499
+ | UI 01 | E09.1 | Person D | `frontend/src/App.tsx` | Create application shell with three panel layout | FND 03 | 0.75h | app renders layout for paper, conversation, and scoring panels | ⬜ Not started | β€” |
500
+ | UI 02 | E09.1 | Person D | `frontend/src/components/PaperPanel.tsx` | Build original paper summary panel | SCN 12 | 0.75h | panel displays title, hypothesis, method, key finding, and seed | ⬜ Not started | β€” |
501
+ | UI 03 | E09.1 | Person D | `frontend/src/components/ProtocolPanel.tsx` | Build current protocol and diff panel | JDG 09 | 1h | panel highlights current plan fields and updates after each round | ⬜ Not started | β€” |
502
+ | UI 04 | E09.1 | Person D | `frontend/src/components/NegotiationLog.tsx` | Build chat style negotiation log | API 03 or API 06 | 1h | scientist and lab manager messages show in correct order with role styling | ⬜ Not started | β€” |
503
+ | UI 05 | E09.1 | Person D | `frontend/src/components/ScorePanel.tsx` | Build rigor, feasibility, fidelity, and total score cards | JDG 09 | 0.75h | score cards render component values and penalties clearly | ⬜ Not started | β€” |
504
+ | UI 06 | E09.2 | Person D | `frontend/src/components/Controls.tsx` | Build new episode, seed input, scenario selector, and start controls | API 02, API 04 | 0.75h | user can start a chosen scenario with chosen seed from UI | ⬜ Not started | β€” |
505
+ | UI 07 | E09.2 | Person D | `frontend/src/lib/api.ts` | Add REST plus WebSocket client helpers | API 02 to API 06 | 0.75h | UI can connect locally and to the hosted Space | ⬜ Not started | β€” |
506
+ | UI 08 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Build replay viewer from completed episode logs | API 05 | 1h | user can load a past episode and step through rounds | ⬜ Not started | β€” |
507
+ | UI 09 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` | Add before versus after panel or static result card | TRN 10 | 0.75h | UI can show reward curve image and summary metrics | ⬜ Not started | β€” |
508
+ | UI 10 | E09.1 | Person D | frontend styling | Add clean visual styling with Tailwind plus shadcn compatible primitives and responsive spacing | UI 01 to UI 09, FND 13 | 0.75h | UI is presentable on demo screen without layout breaks and styling stack matches the declared toolchain | ⬜ Not started | β€” |
509
+ | UI 11 | E09.2 | Person C | integration | Serve frontend with backend or configure proxy during dev | UI 07, API 01 | 0.5h | one command local dev works and deployed app serves UI path | ⬜ Not started | β€” |
510
+ | UI 12 | E09.2 | Person D | tests and smoke | Add smoke test checklist for core UI flow | UI 01 to UI 11 | 0.5h | checklist confirms new episode, step, score update, and replay all work | ⬜ Not started | β€” |
511
+ | UI 13 | E09.1 | Person D | `frontend/src/components/JudgeAuditPanel.tsx` or `NegotiationLog.tsx` | Render final Judge audit text and verdict at episode end | JDG 11, API 18 | 0.75h | UI shows a clear end of episode audit without hiding the deterministic score breakdown | ⬜ Not started | β€” |
512
+ | UI 14 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Add replay slider or scrubber so judges can move across rounds quickly | UI 08 | 0.5h | user can scrub to any round without replaying the full episode sequentially | ⬜ Not started | β€” |
513
+ | UI 15 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` and `Controls.tsx` | Add before versus after training toggle for baseline versus trained views in the demo UI | UI 06, UI 09, TRN 15 | 0.5h | judges can switch between baseline and trained result summaries from the UI | ⬜ Not started | β€” |
514
 
515
  ---
516
 
 
529
 
530
  ### Tasks
531
 
532
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
533
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
534
+ | OBS 01 | E10.1 | Person C | `replicalab/utils/logging.py` | Standardize episode log schema for transcript, state snapshots, and scores | ENV 09 | 0.5h | every completed episode log contains the same required fields | ⬜ Not started | β€” |
535
+ | OBS 02 | E10.1 | Person C | logging config | Add local log levels and readable console formatting | API 01 | 0.5h | debug logs can be toggled without code edits | ⬜ Not started | β€” |
536
+ | OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | ⬜ Not started | β€” |
537
+ | OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | ⬜ Not started | β€” |
538
+ | OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | ⬜ Not started | β€” |
539
+ | OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility | ⬜ Not started | β€” |
540
+ | OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | ⬜ Not started | β€” |
541
+ | OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | ⬜ Not started | β€” |
542
+ | OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | ⬜ Not started | β€” |
543
 
544
  ---
545
 
 
558
 
559
  ### Tasks
560
 
561
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
562
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
563
+ | TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | ⬜ Not started | β€” |
564
+ | TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | ⬜ Not started | β€” |
565
+ | TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | ⬜ Not started | β€” |
566
+ | TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | ⬜ Not started | β€” |
567
+ | TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | ⬜ Not started | β€” |
568
+ | TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | ⬜ Not started | β€” |
569
+ | TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | ⬜ Not started | β€” |
570
+ | TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | ⬜ Not started | β€” |
571
+ | TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits | ⬜ Not started | β€” |
572
+ | TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | ⬜ Not started | β€” |
573
+ | TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | ⬜ Not started | β€” |
574
+ | TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | ⬜ Not started | β€” |
575
 
576
  ---
577
 
 
590
 
591
  ### Tasks
592
 
593
+ | ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
594
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
595
+ | DOC 01 | E12.1 | Person D | `README.md` | Write hook, problem statement, and one line product summary | FND 06 | 0.75h | README opening clearly explains the replication crisis and ReplicaLab solution | ⬜ Not started | β€” |
596
+ | DOC 02 | E12.1 | Person D | `README.md` | Add architecture diagram and environment loop explanation | ENV 06, API 10 | 1h | diagram matches actual code and can be understood in under ten seconds | ⬜ Not started | β€” |
597
+ | DOC 03 | E12.1 | Person D | `README.md` | Add setup instructions for local run, Docker, HF Space, and Colab | API 10, TRN 11 | 0.75h | new user can follow setup without asking the team for hidden steps | ⬜ Not started | β€” |
598
+ | DOC 04 | E12.1 | Person D | `README.md` | Add results section with reward curve and before versus after comparison | TRN 10, TRN 12 | 0.75h | README includes at least one figure and one concrete improvement statement | ⬜ Not started | β€” |
599
+ | DOC 05 | E12.2 | Person D | demo script | Write one minute demo script with time coded scenes | UI 10, TRN 12 | 0.5h | demo script fits within one minute and covers problem, environment, and result | ⬜ Not started | β€” |
600
+ | DOC 06 | E12.2 | Person D | demo assets | Capture screen recording clips and narration or captions | DOC 05 | 1h | raw footage covers all key scenes and is visually clear | ⬜ Not started | β€” |
601
+ | DOC 07 | E12.2 | Person D | final video | Edit and upload final one minute YouTube demo | DOC 06 | 1h | video is public or unlisted, shareable, and under the time limit | ⬜ Not started | β€” |
602
+ | DOC 08 | E12.2 | Person C | repo hygiene | Verify repo is public and all required files are committed | API 10, UI 10, TRN 10 | 0.25h | public repo contains code, notebook, docs, and no secret leakage | ⬜ Not started | β€” |
603
+ | DOC 09 | E12.2 | all | submission form prep | Prepare final submission links and partner track selections | DOC 07, DOC 08 | 0.5h | all submission fields have final links and verified accessibility | ⬜ Not started | β€” |
604
+ | DOC 10 | E12.2 | all | dry run | Run final three minute pitch plus two minute Q and A rehearsal | DOC 09 | 0.75h | team can explain tracks, reward, architecture, and results confidently | ⬜ Not started | β€” |
605
+ | DOC 11 | E12.1 | Person D | `README.md` | Add evaluation summary table for average reward, rounds to agreement, invalid action rate, agreement rate, and note the `/web` fallback route as backup demo path | DOC 03, DOC 04, TRN 15, API 19 | 0.5h | README results and setup sections reflect all promised metrics and clearly document the fallback demo route | ⬜ Not started | β€” |
606
 
607
  ---
608
 
frontend/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
frontend/src/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
frontend/src/components/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
frontend/src/pages/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
notebooks/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/agents/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/outputs/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/outputs/logs/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/outputs/plots/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/outputs/replays/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/prompts/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/scenarios/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/scoring/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
replicalab/utils/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
server/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
tests/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+