Spaces:
Running
Running
Complete FND 01 and FND 10, update task division with status tracking
Browse files- FND 01: Add repo scaffold with all top-level folders and subfolders
- FND 10: Add replicalab/outputs/ with logs, replays, plots subdirs
- Add status and completed-by columns to all epic task tables
- Add status legend to epic backlog section
- Add Section 4.1 training compute availability (H100)
- Mark FND 01 and FND 10 as completed
- ReplicaLab_Comprehensive_Task_Division.md +202 -176
- frontend/.gitkeep +1 -0
- frontend/src/.gitkeep +1 -0
- frontend/src/components/.gitkeep +1 -0
- frontend/src/pages/.gitkeep +1 -0
- notebooks/.gitkeep +1 -0
- replicalab/.gitkeep +1 -0
- replicalab/agents/.gitkeep +1 -0
- replicalab/outputs/.gitkeep +1 -0
- replicalab/outputs/logs/.gitkeep +1 -0
- replicalab/outputs/plots/.gitkeep +1 -0
- replicalab/outputs/replays/.gitkeep +1 -0
- replicalab/prompts/.gitkeep +1 -0
- replicalab/scenarios/.gitkeep +1 -0
- replicalab/scoring/.gitkeep +1 -0
- replicalab/utils/.gitkeep +1 -0
- server/.gitkeep +1 -0
- tests/.gitkeep +1 -0
ReplicaLab_Comprehensive_Task_Division.md
CHANGED
|
@@ -96,6 +96,13 @@ By judging time, the project should demonstrate:
|
|
| 96 |
| Storytelling | everyone contributes screenshots, gifs, examples |
|
| 97 |
| Submission readiness | all four review final demo, notebook, README, repo visibility |
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
---
|
| 100 |
|
| 101 |
## 5. Module and function ownership map
|
|
@@ -183,6 +190,14 @@ Every PR must include:
|
|
| 183 |
|
| 184 |
## 8. Epic backlog
|
| 185 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
---
|
| 187 |
|
| 188 |
## Epic E01. Foundations and repository setup
|
|
@@ -190,6 +205,17 @@ Every PR must include:
|
|
| 190 |
### Epic goal
|
| 191 |
Create a stable shared codebase, contracts, and development workflow so all workstreams can proceed in parallel.
|
| 192 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
### User stories
|
| 194 |
|
| 195 |
**US E01.1**
|
|
@@ -200,21 +226,21 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
|
|
| 200 |
|
| 201 |
### Tasks
|
| 202 |
|
| 203 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 204 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 205 |
-
| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly |
|
| 206 |
-
| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules |
|
| 207 |
-
| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully |
|
| 208 |
-
| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models |
|
| 209 |
-
| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files |
|
| 210 |
-
| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes |
|
| 211 |
-
| FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields |
|
| 212 |
-
| FND 08 | E01.2 | Person A and B | docs or backlog file | Freeze JSON contract for actions and observations | FND 04 | 0.75h | all owners sign off and no blocking contract ambiguity remains |
|
| 213 |
-
| FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file |
|
| 214 |
-
| FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git |
|
| 215 |
-
| FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml |
|
| 216 |
-
| FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging |
|
| 217 |
-
| FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors |
|
| 218 |
|
| 219 |
---
|
| 220 |
|
|
@@ -233,20 +259,20 @@ As the training loop, I need deterministic state serialization so episodes can b
|
|
| 233 |
|
| 234 |
### Tasks
|
| 235 |
|
| 236 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 237 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 238 |
-
| MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors |
|
| 239 |
-
| MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors |
|
| 240 |
-
| MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys |
|
| 241 |
-
| MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works |
|
| 242 |
-
| MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons |
|
| 243 |
-
| MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases |
|
| 244 |
-
| MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss |
|
| 245 |
-
| MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization |
|
| 246 |
-
| MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error |
|
| 247 |
-
| MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads |
|
| 248 |
-
| MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape |
|
| 249 |
-
| MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code |
|
| 250 |
|
| 251 |
---
|
| 252 |
|
|
@@ -265,21 +291,21 @@ As a judge, I want diverse but believable constraints so the environment tests r
|
|
| 265 |
|
| 266 |
### Tasks
|
| 267 |
|
| 268 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 269 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 270 |
-
| SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env |
|
| 271 |
-
| SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define common scenario schema with paper, lab constraints, and hidden rubric sections | MOD 04 | 0.75h | all scenario builders return the same top level structure |
|
| 272 |
-
| SCN 03 | E03.2 | Person A | `replicalab/scenarios/cell_biology.py` | Implement cell biology template with required controls, equipment, and reagent rules | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
|
| 273 |
-
| SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with GPU, time, and baseline constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
|
| 274 |
-
| SCN 05 | E03.2 | Person A | `replicalab/scenarios/behavioral_psych.py` | Implement behavioral psychology survey template with participant, budget, and ethics placeholders | SCN 02 | 1h | generated scenario passes structure and internal consistency tests |
|
| 275 |
-
| SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard | SCN 03 to SCN 05 | 1h | difficulty visibly changes budget or availability in a meaningful way |
|
| 276 |
-
| SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement lab constraint generator for budget, time limit, staff, stock, bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints |
|
| 277 |
-
| SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement minimum viable replication spec per template | SCN 03 to SCN 05 | 1h | hidden rubric clearly marks what is fixed versus flexible |
|
| 278 |
-
| SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content |
|
| 279 |
-
| SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary |
|
| 280 |
-
| SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing |
|
| 281 |
-
| SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges |
|
| 282 |
-
| SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement equipment booking calendar data model with time slot availability, conflict detection, and booking duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts and the Lab Manager can check calendar availability |
|
| 283 |
|
| 284 |
---
|
| 285 |
|
|
@@ -298,19 +324,19 @@ As the Lab Manager, I want deterministic feasibility checks so the environment r
|
|
| 298 |
|
| 299 |
### Tasks
|
| 300 |
|
| 301 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 302 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 303 |
-
| AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft system prompt for Scientist role | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, constraints, and JSON output contract |
|
| 304 |
-
| AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper | AGT 01, MOD 03 | 0.75h | formatted prompt includes paper info, history, and action schema consistently |
|
| 305 |
-
| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure |
|
| 306 |
-
| AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing |
|
| 307 |
-
| AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement feasibility checker against budget, equipment, reagents, schedule, personnel | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension |
|
| 308 |
-
| AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic such as substitute technique or smaller sample size | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails |
|
| 309 |
-
| AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add human readable response templating from feasibility results | AGT 05 | 0.75h | output is stable, readable, and maps cleanly to underlying checks |
|
| 310 |
-
| AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling |
|
| 311 |
-
| AGT 09 | E04.2 | Person A | tests | Add deterministic policy tests for Lab Manager | AGT 05 to AGT 07 | 0.75h | same proposal plus same lab state returns same response every time |
|
| 312 |
-
| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and match the agreed role behavior |
|
| 313 |
-
| AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned |
|
| 314 |
|
| 315 |
---
|
| 316 |
|
|
@@ -329,19 +355,19 @@ As a judge, I need a readable score breakdown so I can understand why the enviro
|
|
| 329 |
|
| 330 |
### Tasks
|
| 331 |
|
| 332 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 333 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 334 |
-
| JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor score for sample size, controls, method, stats, duration | SCN 08 | 1.25h | score is between 0 and 1 and matches rubric examples |
|
| 335 |
-
| JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, equipment, reagents, time, staffing | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches lab constraint logic |
|
| 336 |
-
| JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score for sample ratio, technique match, control completeness | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples |
|
| 337 |
-
| JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output |
|
| 338 |
-
| JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores and penalties | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, and penalties |
|
| 339 |
-
| JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric and introduces no new hidden logic |
|
| 340 |
-
| JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement |
|
| 341 |
-
| JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering |
|
| 342 |
-
| JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data |
|
| 343 |
-
| JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity over time |
|
| 344 |
-
| JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI |
|
| 345 |
|
| 346 |
---
|
| 347 |
|
|
@@ -363,19 +389,19 @@ As a judge, I want deterministic replay and cleanup.
|
|
| 363 |
|
| 364 |
### Tasks
|
| 365 |
|
| 366 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 367 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 368 |
-
| ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors |
|
| 369 |
-
| ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state |
|
| 370 |
-
| ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application | ENV 02, AGT 05 | 1h | valid Scientist action updates state and history correctly |
|
| 371 |
-
| ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step | ENV 03, AGT 07 | 1h | lab manager response is appended and returned in the next observation |
|
| 372 |
-
| ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit |
|
| 373 |
-
| ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward and breakdown info |
|
| 374 |
-
| ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay |
|
| 375 |
-
| ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw |
|
| 376 |
-
| ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically |
|
| 377 |
-
| ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency |
|
| 378 |
-
| ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema |
|
| 379 |
|
| 380 |
---
|
| 381 |
|
|
@@ -394,27 +420,27 @@ As the team, we want one click reproducible deployment to HF Spaces.
|
|
| 394 |
|
| 395 |
### Tasks
|
| 396 |
|
| 397 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 398 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 399 |
-
| API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload |
|
| 400 |
-
| API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation |
|
| 401 |
-
| API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result |
|
| 402 |
-
| API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties |
|
| 403 |
-
| API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id |
|
| 404 |
-
| API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step |
|
| 405 |
-
| API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak |
|
| 406 |
-
| API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 |
|
| 407 |
-
| API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment |
|
| 408 |
-
| API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode |
|
| 409 |
-
| API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect |
|
| 410 |
-
| API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available |
|
| 411 |
-
| API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors |
|
| 412 |
-
| API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state |
|
| 413 |
-
| API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata |
|
| 414 |
-
| API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 |
|
| 415 |
-
| API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for Scientist LLM access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets |
|
| 416 |
-
| API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes` and verdict fields without separate log file access |
|
| 417 |
-
| API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable |
|
| 418 |
|
| 419 |
---
|
| 420 |
|
|
@@ -433,23 +459,23 @@ As the team, we want a repeatable evaluation workflow for before versus after co
|
|
| 433 |
|
| 434 |
### Tasks
|
| 435 |
|
| 436 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 437 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 438 |
-
| TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order |
|
| 439 |
-
| TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets |
|
| 440 |
-
| TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env |
|
| 441 |
-
| TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, and done signals |
|
| 442 |
-
| TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors |
|
| 443 |
-
| TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, and rounds used | JDG 10, TRN 04 | 0.75h | notebook stores metrics frame across training episodes |
|
| 444 |
-
| TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file |
|
| 445 |
-
| TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios |
|
| 446 |
-
| TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly |
|
| 447 |
-
| TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use |
|
| 448 |
-
| TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes |
|
| 449 |
-
| TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English |
|
| 450 |
-
| TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic |
|
| 451 |
-
| TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained |
|
| 452 |
-
| TRN 15 | E08.2 | Person B | notebook | Add agreement rate and invalid action rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, and invalid action rate for baseline and trained runs |
|
| 453 |
|
| 454 |
---
|
| 455 |
|
|
@@ -468,23 +494,23 @@ As a team, we want a replayable UI for debugging and recording the demo.
|
|
| 468 |
|
| 469 |
### Tasks
|
| 470 |
|
| 471 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 472 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 473 |
-
| UI 01 | E09.1 | Person D | `frontend/src/App.tsx` | Create application shell with three panel layout | FND 03 | 0.75h | app renders layout for paper, conversation, and scoring panels |
|
| 474 |
-
| UI 02 | E09.1 | Person D | `frontend/src/components/PaperPanel.tsx` | Build original paper summary panel | SCN 12 | 0.75h | panel displays title, hypothesis, method, key finding, and seed |
|
| 475 |
-
| UI 03 | E09.1 | Person D | `frontend/src/components/ProtocolPanel.tsx` | Build current protocol and diff panel | JDG 09 | 1h | panel highlights current plan fields and updates after each round |
|
| 476 |
-
| UI 04 | E09.1 | Person D | `frontend/src/components/NegotiationLog.tsx` | Build chat style negotiation log | API 03 or API 06 | 1h | scientist and lab manager messages show in correct order with role styling |
|
| 477 |
-
| UI 05 | E09.1 | Person D | `frontend/src/components/ScorePanel.tsx` | Build rigor, feasibility, fidelity, and total score cards | JDG 09 | 0.75h | score cards render component values and penalties clearly |
|
| 478 |
-
| UI 06 | E09.2 | Person D | `frontend/src/components/Controls.tsx` | Build new episode, seed input, scenario selector, and start controls | API 02, API 04 | 0.75h | user can start a chosen scenario with chosen seed from UI |
|
| 479 |
-
| UI 07 | E09.2 | Person D | `frontend/src/lib/api.ts` | Add REST plus WebSocket client helpers | API 02 to API 06 | 0.75h | UI can connect locally and to the hosted Space |
|
| 480 |
-
| UI 08 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Build replay viewer from completed episode logs | API 05 | 1h | user can load a past episode and step through rounds |
|
| 481 |
-
| UI 09 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` | Add before versus after panel or static result card | TRN 10 | 0.75h | UI can show reward curve image and summary metrics |
|
| 482 |
-
| UI 10 | E09.1 | Person D | frontend styling | Add clean visual styling with Tailwind plus shadcn compatible primitives and responsive spacing | UI 01 to UI 09, FND 13 | 0.75h | UI is presentable on demo screen without layout breaks and styling stack matches the declared toolchain |
|
| 483 |
-
| UI 11 | E09.2 | Person C | integration | Serve frontend with backend or configure proxy during dev | UI 07, API 01 | 0.5h | one command local dev works and deployed app serves UI path |
|
| 484 |
-
| UI 12 | E09.2 | Person D | tests and smoke | Add smoke test checklist for core UI flow | UI 01 to UI 11 | 0.5h | checklist confirms new episode, step, score update, and replay all work |
|
| 485 |
-
| UI 13 | E09.1 | Person D | `frontend/src/components/JudgeAuditPanel.tsx` or `NegotiationLog.tsx` | Render final Judge audit text and verdict at episode end | JDG 11, API 18 | 0.75h | UI shows a clear end of episode audit without hiding the deterministic score breakdown |
|
| 486 |
-
| UI 14 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Add replay slider or scrubber so judges can move across rounds quickly | UI 08 | 0.5h | user can scrub to any round without replaying the full episode sequentially |
|
| 487 |
-
| UI 15 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` and `Controls.tsx` | Add before versus after training toggle for baseline versus trained views in the demo UI | UI 06, UI 09, TRN 15 | 0.5h | judges can switch between baseline and trained result summaries from the UI |
|
| 488 |
|
| 489 |
---
|
| 490 |
|
|
@@ -503,17 +529,17 @@ As a judge, I want the same seeded scenario to be replayable.
|
|
| 503 |
|
| 504 |
### Tasks
|
| 505 |
|
| 506 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 507 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 508 |
-
| OBS 01 | E10.1 | Person C | `replicalab/utils/logging.py` | Standardize episode log schema for transcript, state snapshots, and scores | ENV 09 | 0.5h | every completed episode log contains the same required fields |
|
| 509 |
-
| OBS 02 | E10.1 | Person C | logging config | Add local log levels and readable console formatting | API 01 | 0.5h | debug logs can be toggled without code edits |
|
| 510 |
-
| OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate |
|
| 511 |
-
| OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence |
|
| 512 |
-
| OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode |
|
| 513 |
-
| OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility |
|
| 514 |
-
| OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log |
|
| 515 |
-
| OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo |
|
| 516 |
-
| OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README |
|
| 517 |
|
| 518 |
---
|
| 519 |
|
|
@@ -532,20 +558,20 @@ As a judge, I want the system to work reliably when clicked live.
|
|
| 532 |
|
| 533 |
### Tasks
|
| 534 |
|
| 535 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 536 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 537 |
-
| TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations |
|
| 538 |
-
| TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape |
|
| 539 |
-
| TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives |
|
| 540 |
-
| TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol |
|
| 541 |
-
| TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected |
|
| 542 |
-
| TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally |
|
| 543 |
-
| TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated |
|
| 544 |
-
| TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes |
|
| 545 |
-
| TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits |
|
| 546 |
-
| TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day |
|
| 547 |
-
| TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics |
|
| 548 |
-
| TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready |
|
| 549 |
|
| 550 |
---
|
| 551 |
|
|
@@ -564,19 +590,19 @@ As the team, we want all submission requirements complete and polished.
|
|
| 564 |
|
| 565 |
### Tasks
|
| 566 |
|
| 567 |
-
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria |
|
| 568 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 569 |
-
| DOC 01 | E12.1 | Person D | `README.md` | Write hook, problem statement, and one line product summary | FND 06 | 0.75h | README opening clearly explains the replication crisis and ReplicaLab solution |
|
| 570 |
-
| DOC 02 | E12.1 | Person D | `README.md` | Add architecture diagram and environment loop explanation | ENV 06, API 10 | 1h | diagram matches actual code and can be understood in under ten seconds |
|
| 571 |
-
| DOC 03 | E12.1 | Person D | `README.md` | Add setup instructions for local run, Docker, HF Space, and Colab | API 10, TRN 11 | 0.75h | new user can follow setup without asking the team for hidden steps |
|
| 572 |
-
| DOC 04 | E12.1 | Person D | `README.md` | Add results section with reward curve and before versus after comparison | TRN 10, TRN 12 | 0.75h | README includes at least one figure and one concrete improvement statement |
|
| 573 |
-
| DOC 05 | E12.2 | Person D | demo script | Write one minute demo script with time coded scenes | UI 10, TRN 12 | 0.5h | demo script fits within one minute and covers problem, environment, and result |
|
| 574 |
-
| DOC 06 | E12.2 | Person D | demo assets | Capture screen recording clips and narration or captions | DOC 05 | 1h | raw footage covers all key scenes and is visually clear |
|
| 575 |
-
| DOC 07 | E12.2 | Person D | final video | Edit and upload final one minute YouTube demo | DOC 06 | 1h | video is public or unlisted, shareable, and under the time limit |
|
| 576 |
-
| DOC 08 | E12.2 | Person C | repo hygiene | Verify repo is public and all required files are committed | API 10, UI 10, TRN 10 | 0.25h | public repo contains code, notebook, docs, and no secret leakage |
|
| 577 |
-
| DOC 09 | E12.2 | all | submission form prep | Prepare final submission links and partner track selections | DOC 07, DOC 08 | 0.5h | all submission fields have final links and verified accessibility |
|
| 578 |
-
| DOC 10 | E12.2 | all | dry run | Run final three minute pitch plus two minute Q and A rehearsal | DOC 09 | 0.75h | team can explain tracks, reward, architecture, and results confidently |
|
| 579 |
-
| DOC 11 | E12.1 | Person D | `README.md` | Add evaluation summary table for average reward, rounds to agreement, invalid action rate, agreement rate, and note the `/web` fallback route as backup demo path | DOC 03, DOC 04, TRN 15, API 19 | 0.5h | README results and setup sections reflect all promised metrics and clearly document the fallback demo route |
|
| 580 |
|
| 581 |
---
|
| 582 |
|
|
|
|
| 96 |
| Storytelling | everyone contributes screenshots, gifs, examples |
|
| 97 |
| Submission readiness | all four review final demo, notebook, README, repo visibility |
|
| 98 |
|
| 99 |
+
## 4.1 Training compute availability
|
| 100 |
+
|
| 101 |
+
1. The team has access to an H100 GPU for heavier Scientist training and evaluation runs.
|
| 102 |
+
2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
|
| 103 |
+
3. The judged artifact remains the Colab notebook, so any H100 run must still have a documented notebook path or reduced scale fallback that can be shown in Colab.
|
| 104 |
+
4. Person C supports any environment URL, secret, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
|
| 105 |
+
|
| 106 |
---
|
| 107 |
|
| 108 |
## 5. Module and function ownership map
|
|
|
|
| 190 |
|
| 191 |
## 8. Epic backlog
|
| 192 |
|
| 193 |
+
### Status legend
|
| 194 |
+
|
| 195 |
+
- `β
Completed`
|
| 196 |
+
- `β Failed`
|
| 197 |
+
- `π‘ Partial`
|
| 198 |
+
- `β¬ Not started`
|
| 199 |
+
- `Completed by`: fill this only when the finisher is different from the assigned owner; otherwise use `β`
|
| 200 |
+
|
| 201 |
---
|
| 202 |
|
| 203 |
## Epic E01. Foundations and repository setup
|
|
|
|
| 205 |
### Epic goal
|
| 206 |
Create a stable shared codebase, contracts, and development workflow so all workstreams can proceed in parallel.
|
| 207 |
|
| 208 |
+
### Current status
|
| 209 |
+
|
| 210 |
+
- `FND 01` status: completed on 2026-03-07
|
| 211 |
+
- `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
|
| 212 |
+
- `FND 10` status: completed on 2026-03-07
|
| 213 |
+
- `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
|
| 214 |
+
- Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
|
| 215 |
+
- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
|
| 216 |
+
- Remaining work now unblocked by `FND 01`: `FND 02`, `FND 03`, `FND 04`, `FND 05`, `FND 06`, `FND 07`
|
| 217 |
+
- Remaining Epic E01 work still gated by follow-on dependencies: `FND 08`, `FND 09`, `FND 11`, `FND 12`, `FND 13`
|
| 218 |
+
|
| 219 |
### User stories
|
| 220 |
|
| 221 |
**US E01.1**
|
|
|
|
| 226 |
|
| 227 |
### Tasks
|
| 228 |
|
| 229 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 230 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 231 |
+
| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | β
Completed | Person B (Ayush) |
|
| 232 |
+
| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | β¬ Not started | β |
|
| 233 |
+
| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | β¬ Not started | β |
|
| 234 |
+
| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | β¬ Not started | β |
|
| 235 |
+
| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | β¬ Not started | β |
|
| 236 |
+
| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | β¬ Not started | β |
|
| 237 |
+
| FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | β¬ Not started | β |
|
| 238 |
+
| FND 08 | E01.2 | Person A and B | docs or backlog file | Freeze JSON contract for actions and observations | FND 04 | 0.75h | all owners sign off and no blocking contract ambiguity remains | β¬ Not started | β |
|
| 239 |
+
| FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | β¬ Not started | β |
|
| 240 |
+
| FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | β
Completed | Person B (Ayush) |
|
| 241 |
+
| FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | β¬ Not started | β |
|
| 242 |
+
| FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | β¬ Not started | β |
|
| 243 |
+
| FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | β¬ Not started | β |
|
| 244 |
|
| 245 |
---
|
| 246 |
|
|
|
|
| 259 |
|
| 260 |
### Tasks
|
| 261 |
|
| 262 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 263 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 264 |
+
| MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | β¬ Not started | β |
|
| 265 |
+
| MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | β¬ Not started | β |
|
| 266 |
+
| MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | β¬ Not started | β |
|
| 267 |
+
| MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works | β¬ Not started | β |
|
| 268 |
+
| MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons | β¬ Not started | β |
|
| 269 |
+
| MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | β¬ Not started | β |
|
| 270 |
+
| MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | β¬ Not started | β |
|
| 271 |
+
| MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | β¬ Not started | β |
|
| 272 |
+
| MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error | β¬ Not started | β |
|
| 273 |
+
| MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | β¬ Not started | β |
|
| 274 |
+
| MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape | β¬ Not started | β |
|
| 275 |
+
| MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code | β¬ Not started | β |
|
| 276 |
|
| 277 |
---
|
| 278 |
|
|
|
|
| 291 |
|
| 292 |
### Tasks
|
| 293 |
|
| 294 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 295 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 296 |
+
| SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env | β¬ Not started | β |
|
| 297 |
+
| SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define common scenario schema with paper, lab constraints, and hidden rubric sections | MOD 04 | 0.75h | all scenario builders return the same top level structure | β¬ Not started | β |
|
| 298 |
+
| SCN 03 | E03.2 | Person A | `replicalab/scenarios/cell_biology.py` | Implement cell biology template with required controls, equipment, and reagent rules | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | β¬ Not started | β |
|
| 299 |
+
| SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with GPU, time, and baseline constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | β¬ Not started | β |
|
| 300 |
+
| SCN 05 | E03.2 | Person A | `replicalab/scenarios/behavioral_psych.py` | Implement behavioral psychology survey template with participant, budget, and ethics placeholders | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | β¬ Not started | β |
|
| 301 |
+
| SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard | SCN 03 to SCN 05 | 1h | difficulty visibly changes budget or availability in a meaningful way | β¬ Not started | β |
|
| 302 |
+
| SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement lab constraint generator for budget, time limit, staff, stock, bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints | β¬ Not started | β |
|
| 303 |
+
| SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement minimum viable replication spec per template | SCN 03 to SCN 05 | 1h | hidden rubric clearly marks what is fixed versus flexible | β¬ Not started | β |
|
| 304 |
+
| SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content | β¬ Not started | β |
|
| 305 |
+
| SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary | β¬ Not started | β |
|
| 306 |
+
| SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing | β¬ Not started | β |
|
| 307 |
+
| SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | β¬ Not started | β |
|
| 308 |
+
| SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement equipment booking calendar data model with time slot availability, conflict detection, and booking duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts and the Lab Manager can check calendar availability | β¬ Not started | β |
|
| 309 |
|
| 310 |
---
|
| 311 |
|
|
|
|
| 324 |
|
| 325 |
### Tasks
|
| 326 |
|
| 327 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 328 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 329 |
+
| AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft system prompt for Scientist role | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, constraints, and JSON output contract | β¬ Not started | β |
|
| 330 |
+
| AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper | AGT 01, MOD 03 | 0.75h | formatted prompt includes paper info, history, and action schema consistently | β¬ Not started | β |
|
| 331 |
+
| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | β¬ Not started | β |
|
| 332 |
+
| AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | β¬ Not started | β |
|
| 333 |
+
| AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement feasibility checker against budget, equipment, reagents, schedule, personnel | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | β¬ Not started | β |
|
| 334 |
+
| AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic such as substitute technique or smaller sample size | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | β¬ Not started | β |
|
| 335 |
+
| AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add human readable response templating from feasibility results | AGT 05 | 0.75h | output is stable, readable, and maps cleanly to underlying checks | β¬ Not started | β |
|
| 336 |
+
| AGT 08 | E04.1 | Person B | tests | Add prompt formatting and parse tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path and malformed output handling | β¬ Not started | β |
|
| 337 |
+
| AGT 09 | E04.2 | Person A | tests | Add deterministic policy tests for Lab Manager | AGT 05 to AGT 07 | 0.75h | same proposal plus same lab state returns same response every time | β¬ Not started | β |
|
| 338 |
+
| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt` | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, and match the agreed role behavior | β¬ Not started | β |
|
| 339 |
+
| AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | β¬ Not started | β |
|
| 340 |
|
| 341 |
---
|
| 342 |
|
|
|
|
| 355 |
|
| 356 |
### Tasks
|
| 357 |
|
| 358 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 359 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 360 |
+
| JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor score for sample size, controls, method, stats, duration | SCN 08 | 1.25h | score is between 0 and 1 and matches rubric examples | β¬ Not started | β |
|
| 361 |
+
| JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, equipment, reagents, time, staffing | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches lab constraint logic | β¬ Not started | β |
|
| 362 |
+
| JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score for sample ratio, technique match, control completeness | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples | β¬ Not started | β |
|
| 363 |
+
| JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output | β¬ Not started | β |
|
| 364 |
+
| JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores and penalties | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, and penalties | β¬ Not started | β |
|
| 365 |
+
| JDG 06 | E05.2 | Person A | `replicalab/agents/judge_policy.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric and introduces no new hidden logic | β¬ Not started | β |
|
| 366 |
+
| JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement | β¬ Not started | β |
|
| 367 |
+
| JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | β¬ Not started | β |
|
| 368 |
+
| JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | β¬ Not started | β |
|
| 369 |
+
| JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity over time | β¬ Not started | β |
|
| 370 |
+
| JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | β¬ Not started | β |
|
| 371 |
|
| 372 |
---
|
| 373 |
|
|
|
|
| 389 |
|
| 390 |
### Tasks
|
| 391 |
|
| 392 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 393 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 394 |
+
| ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | β¬ Not started | β |
|
| 395 |
+
| ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | β¬ Not started | β |
|
| 396 |
+
| ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application | ENV 02, AGT 05 | 1h | valid Scientist action updates state and history correctly | β¬ Not started | β |
|
| 397 |
+
| ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step | ENV 03, AGT 07 | 1h | lab manager response is appended and returned in the next observation | β¬ Not started | β |
|
| 398 |
+
| ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | β¬ Not started | β |
|
| 399 |
+
| ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward and breakdown info | β¬ Not started | β |
|
| 400 |
+
| ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | β¬ Not started | β |
|
| 401 |
+
| ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | β¬ Not started | β |
|
| 402 |
+
| ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | β¬ Not started | β |
|
| 403 |
+
| ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | β¬ Not started | β |
|
| 404 |
+
| ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | β¬ Not started | β |
|
| 405 |
|
| 406 |
---
|
| 407 |
|
|
|
|
| 420 |
|
| 421 |
### Tasks
|
| 422 |
|
| 423 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 424 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 425 |
+
| API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | β¬ Not started | β |
|
| 426 |
+
| API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | β¬ Not started | β |
|
| 427 |
+
| API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | β¬ Not started | β |
|
| 428 |
+
| API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | β¬ Not started | β |
|
| 429 |
+
| API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | β¬ Not started | β |
|
| 430 |
+
| API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | β¬ Not started | β |
|
| 431 |
+
| API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | β¬ Not started | β |
|
| 432 |
+
| API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | β¬ Not started | β |
|
| 433 |
+
| API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | β¬ Not started | β |
|
| 434 |
+
| API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | β¬ Not started | β |
|
| 435 |
+
| API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | β¬ Not started | β |
|
| 436 |
+
| API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | β¬ Not started | β |
|
| 437 |
+
| API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | β¬ Not started | β |
|
| 438 |
+
| API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | β¬ Not started | β |
|
| 439 |
+
| API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | β¬ Not started | β |
|
| 440 |
+
| API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | β¬ Not started | β |
|
| 441 |
+
| API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for Scientist LLM access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | β¬ Not started | β |
|
| 442 |
+
| API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes` and verdict fields without separate log file access | β¬ Not started | β |
|
| 443 |
+
| API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | β¬ Not started | β |
|
| 444 |
|
| 445 |
---
|
| 446 |
|
|
|
|
| 459 |
|
| 460 |
### Tasks
|
| 461 |
|
| 462 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 463 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 464 |
+
| TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order | β¬ Not started | β |
|
| 465 |
+
| TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | β¬ Not started | β |
|
| 466 |
+
| TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env | β¬ Not started | β |
|
| 467 |
+
| TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, and done signals | β¬ Not started | β |
|
| 468 |
+
| TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors | β¬ Not started | β |
|
| 469 |
+
| TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, and rounds used | JDG 10, TRN 04 | 0.75h | notebook stores metrics frame across training episodes | β¬ Not started | β |
|
| 470 |
+
| TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | β¬ Not started | β |
|
| 471 |
+
| TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios | β¬ Not started | β |
|
| 472 |
+
| TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | β¬ Not started | β |
|
| 473 |
+
| TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | β¬ Not started | β |
|
| 474 |
+
| TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | β¬ Not started | β |
|
| 475 |
+
| TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | β¬ Not started | β |
|
| 476 |
+
| TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | β¬ Not started | β |
|
| 477 |
+
| TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | β¬ Not started | β |
|
| 478 |
+
| TRN 15 | E08.2 | Person B | notebook | Add agreement rate and invalid action rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, and invalid action rate for baseline and trained runs | β¬ Not started | β |
|
| 479 |
|
| 480 |
---
|
| 481 |
|
|
|
|
| 494 |
|
| 495 |
### Tasks
|
| 496 |
|
| 497 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 498 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 499 |
+
| UI 01 | E09.1 | Person D | `frontend/src/App.tsx` | Create application shell with three panel layout | FND 03 | 0.75h | app renders layout for paper, conversation, and scoring panels | β¬ Not started | β |
|
| 500 |
+
| UI 02 | E09.1 | Person D | `frontend/src/components/PaperPanel.tsx` | Build original paper summary panel | SCN 12 | 0.75h | panel displays title, hypothesis, method, key finding, and seed | β¬ Not started | β |
|
| 501 |
+
| UI 03 | E09.1 | Person D | `frontend/src/components/ProtocolPanel.tsx` | Build current protocol and diff panel | JDG 09 | 1h | panel highlights current plan fields and updates after each round | β¬ Not started | β |
|
| 502 |
+
| UI 04 | E09.1 | Person D | `frontend/src/components/NegotiationLog.tsx` | Build chat style negotiation log | API 03 or API 06 | 1h | scientist and lab manager messages show in correct order with role styling | β¬ Not started | β |
|
| 503 |
+
| UI 05 | E09.1 | Person D | `frontend/src/components/ScorePanel.tsx` | Build rigor, feasibility, fidelity, and total score cards | JDG 09 | 0.75h | score cards render component values and penalties clearly | β¬ Not started | β |
|
| 504 |
+
| UI 06 | E09.2 | Person D | `frontend/src/components/Controls.tsx` | Build new episode, seed input, scenario selector, and start controls | API 02, API 04 | 0.75h | user can start a chosen scenario with chosen seed from UI | β¬ Not started | β |
|
| 505 |
+
| UI 07 | E09.2 | Person D | `frontend/src/lib/api.ts` | Add REST plus WebSocket client helpers | API 02 to API 06 | 0.75h | UI can connect locally and to the hosted Space | β¬ Not started | β |
|
| 506 |
+
| UI 08 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Build replay viewer from completed episode logs | API 05 | 1h | user can load a past episode and step through rounds | β¬ Not started | β |
|
| 507 |
+
| UI 09 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` | Add before versus after panel or static result card | TRN 10 | 0.75h | UI can show reward curve image and summary metrics | β¬ Not started | β |
|
| 508 |
+
| UI 10 | E09.1 | Person D | frontend styling | Add clean visual styling with Tailwind plus shadcn compatible primitives and responsive spacing | UI 01 to UI 09, FND 13 | 0.75h | UI is presentable on demo screen without layout breaks and styling stack matches the declared toolchain | β¬ Not started | β |
|
| 509 |
+
| UI 11 | E09.2 | Person C | integration | Serve frontend with backend or configure proxy during dev | UI 07, API 01 | 0.5h | one command local dev works and deployed app serves UI path | β¬ Not started | β |
|
| 510 |
+
| UI 12 | E09.2 | Person D | tests and smoke | Add smoke test checklist for core UI flow | UI 01 to UI 11 | 0.5h | checklist confirms new episode, step, score update, and replay all work | β¬ Not started | β |
|
| 511 |
+
| UI 13 | E09.1 | Person D | `frontend/src/components/JudgeAuditPanel.tsx` or `NegotiationLog.tsx` | Render final Judge audit text and verdict at episode end | JDG 11, API 18 | 0.75h | UI shows a clear end of episode audit without hiding the deterministic score breakdown | β¬ Not started | β |
|
| 512 |
+
| UI 14 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Add replay slider or scrubber so judges can move across rounds quickly | UI 08 | 0.5h | user can scrub to any round without replaying the full episode sequentially | β¬ Not started | β |
|
| 513 |
+
| UI 15 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` and `Controls.tsx` | Add before versus after training toggle for baseline versus trained views in the demo UI | UI 06, UI 09, TRN 15 | 0.5h | judges can switch between baseline and trained result summaries from the UI | β¬ Not started | β |
|
| 514 |
|
| 515 |
---
|
| 516 |
|
|
|
|
| 529 |
|
| 530 |
### Tasks
|
| 531 |
|
| 532 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 533 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 534 |
+
| OBS 01 | E10.1 | Person C | `replicalab/utils/logging.py` | Standardize episode log schema for transcript, state snapshots, and scores | ENV 09 | 0.5h | every completed episode log contains the same required fields | β¬ Not started | β |
|
| 535 |
+
| OBS 02 | E10.1 | Person C | logging config | Add local log levels and readable console formatting | API 01 | 0.5h | debug logs can be toggled without code edits | β¬ Not started | β |
|
| 536 |
+
| OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | β¬ Not started | β |
|
| 537 |
+
| OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | β¬ Not started | β |
|
| 538 |
+
| OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | β¬ Not started | β |
|
| 539 |
+
| OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility | β¬ Not started | β |
|
| 540 |
+
| OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | β¬ Not started | β |
|
| 541 |
+
| OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | β¬ Not started | β |
|
| 542 |
+
| OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | β¬ Not started | β |
|
| 543 |
|
| 544 |
---
|
| 545 |
|
|
|
|
| 558 |
|
| 559 |
### Tasks
|
| 560 |
|
| 561 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 562 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 563 |
+
| TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | β¬ Not started | β |
|
| 564 |
+
| TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | β¬ Not started | β |
|
| 565 |
+
| TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | β¬ Not started | β |
|
| 566 |
+
| TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | β¬ Not started | β |
|
| 567 |
+
| TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | β¬ Not started | β |
|
| 568 |
+
| TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | β¬ Not started | β |
|
| 569 |
+
| TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | β¬ Not started | β |
|
| 570 |
+
| TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | β¬ Not started | β |
|
| 571 |
+
| TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits | β¬ Not started | β |
|
| 572 |
+
| TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | β¬ Not started | β |
|
| 573 |
+
| TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | β¬ Not started | β |
|
| 574 |
+
| TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | β¬ Not started | β |
|
| 575 |
|
| 576 |
---
|
| 577 |
|
|
|
|
| 590 |
|
| 591 |
### Tasks
|
| 592 |
|
| 593 |
+
| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
|
| 594 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 595 |
+
| DOC 01 | E12.1 | Person D | `README.md` | Write hook, problem statement, and one line product summary | FND 06 | 0.75h | README opening clearly explains the replication crisis and ReplicaLab solution | β¬ Not started | β |
|
| 596 |
+
| DOC 02 | E12.1 | Person D | `README.md` | Add architecture diagram and environment loop explanation | ENV 06, API 10 | 1h | diagram matches actual code and can be understood in under ten seconds | β¬ Not started | β |
|
| 597 |
+
| DOC 03 | E12.1 | Person D | `README.md` | Add setup instructions for local run, Docker, HF Space, and Colab | API 10, TRN 11 | 0.75h | new user can follow setup without asking the team for hidden steps | β¬ Not started | β |
|
| 598 |
+
| DOC 04 | E12.1 | Person D | `README.md` | Add results section with reward curve and before versus after comparison | TRN 10, TRN 12 | 0.75h | README includes at least one figure and one concrete improvement statement | β¬ Not started | β |
|
| 599 |
+
| DOC 05 | E12.2 | Person D | demo script | Write one minute demo script with time coded scenes | UI 10, TRN 12 | 0.5h | demo script fits within one minute and covers problem, environment, and result | β¬ Not started | β |
|
| 600 |
+
| DOC 06 | E12.2 | Person D | demo assets | Capture screen recording clips and narration or captions | DOC 05 | 1h | raw footage covers all key scenes and is visually clear | β¬ Not started | β |
|
| 601 |
+
| DOC 07 | E12.2 | Person D | final video | Edit and upload final one minute YouTube demo | DOC 06 | 1h | video is public or unlisted, shareable, and under the time limit | β¬ Not started | β |
|
| 602 |
+
| DOC 08 | E12.2 | Person C | repo hygiene | Verify repo is public and all required files are committed | API 10, UI 10, TRN 10 | 0.25h | public repo contains code, notebook, docs, and no secret leakage | β¬ Not started | β |
|
| 603 |
+
| DOC 09 | E12.2 | all | submission form prep | Prepare final submission links and partner track selections | DOC 07, DOC 08 | 0.5h | all submission fields have final links and verified accessibility | β¬ Not started | β |
|
| 604 |
+
| DOC 10 | E12.2 | all | dry run | Run final three minute pitch plus two minute Q and A rehearsal | DOC 09 | 0.75h | team can explain tracks, reward, architecture, and results confidently | β¬ Not started | β |
|
| 605 |
+
| DOC 11 | E12.1 | Person D | `README.md` | Add evaluation summary table for average reward, rounds to agreement, invalid action rate, agreement rate, and note the `/web` fallback route as backup demo path | DOC 03, DOC 04, TRN 15, API 19 | 0.5h | README results and setup sections reflect all promised metrics and clearly document the fallback demo route | β¬ Not started | β |
|
| 606 |
|
| 607 |
---
|
| 608 |
|
frontend/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
frontend/src/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
frontend/src/components/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
frontend/src/pages/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
notebooks/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/agents/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/outputs/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/outputs/logs/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/outputs/plots/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/outputs/replays/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/prompts/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/scenarios/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/scoring/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
replicalab/utils/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
server/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
tests/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|