Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / ReplicaLab_Comprehensive_Task_Division.md

maxxie114

Initial HF Spaces deployment

80d8c84 2 days ago

preview code

raw

history blame contribute delete

81.7 kB



	# ReplicaLab Comprehensive Task Division and Delivery Backlog

	## 1. Document purpose

	This document is the working blueprint for building ReplicaLab in a hackathon setting with a 4 person team. It is written like a lightweight real world delivery plan with:

	1. Product scope
	2. Team ownership
	3. Module and function ownership
	4. Epics
	5. User stories
	6. Lowest level tasks
	7. Dependencies
	8. Acceptance criteria
	9. Delivery workflow
	10. Definition of done

	The goal is to let any team member pick up work immediately without confusion.

	---

	## 2. Product summary

	ReplicaLab is an OpenEnv environment where a Scientist agent and a Lab Manager agent negotiate how to solve a constrained technical task under real world limits such as budget, tools, compute, schedule, stock, and staffing.

	The environment is used to train the Scientist agent with reinforcement learning so it learns to ask better questions, preserve objective quality, use bounded evidence tools correctly, and produce more feasible plans under domain-specific constraints.

	The first domain focus is:

	1. Mathematics
	2. Machine learning
	3. Finance and trading design in offline or backtest form

	Physics and biology remain follow-on adapters once the normalized scenario layer is stable.

	### The judged MVP outcome

	By judging time, the project should demonstrate:

	1. A working OpenEnv environment deployed on Hugging Face Spaces on port `7860`
	2. At least one full scenario family working end to end, with a target of three
	3. A Scientist agent that can interact with the environment through structured actions and bounded evidence tools
	4. A hybrid model-backed Lab Manager with deterministic feasibility grounding and bounded validation tools
	5. A deterministic judge and reward engine
	6. A Colab training notebook plus reusable H100 job path using Unsloth or HF TRL
	7. A reward curve showing improvement
	8. A public GitHub repository
	9. A one minute YouTube demo
	10. A README with architecture, setup, and results

	---

	## 3. Scope control

	## 3.1 In scope for the hackathon MVP

	1. OpenEnv environment implementation
	2. FastAPI and WebSocket serving
	3. Hugging Face Docker Space deployment
	4. Scientist agent with structured JSON action output plus bounded search, code-check, and image-inspection capability
	5. Hybrid model-backed Lab Manager grounded by deterministic feasibility checks plus bounded validation tools
	6. Judge rubric engine with deterministic scoring
	7. Three scenario families for MVP
	1. Mathematics reasoning and proof planning
	2. ML benchmark replication
	3. Finance or trading backtest planning
	8. Frozen evidence packs for deterministic training plus limited live validation during demo or eval
	9. Reward logging
	10. Replay logs
	11. Colab RL notebook
	12. Reward curve image
	13. Thin React plus Vite frontend or OpenEnv `/web` fallback
	14. README, demo video, submission package

	## 3.2 Out of scope for the hackathon MVP

	1. Proving whether a real research paper is globally true or false
	2. Unrestricted parsing of arbitrary live internet content inside the training loop
	3. Real wet lab execution
	4. Live trading or production finance execution
	5. Real time collaboration features
	6. Training both Scientist and Lab Manager in self play
	7. Open-ended autonomous coding outside a bounded verification or analysis sandbox
	8. Image generation or audio capabilities in the agent policy loop
	9. Complex third party enterprise integrations
	10. Full multi-domain rollout unless time remains
	11. Manager-led subagent orchestration unless the MVP is already stable

	---

	## 4. Team structure and role ownership

	\| Role \| Owner focus \| Primary responsibilities \| Secondary responsibilities \|
	\| --- \| --- \| --- \| --- \|
	\| Person A \| Environment and Scoring Lead \| scenario engine, constraint logic, reward logic, state transitions, tests \| supports judge audit text \|
	\| Person B \| RL and Agent Lead \| Scientist prompting, action schemas, training loop, rollouts, evaluation, reward curves \| supports lab manager templating \|
	\| Person C \| Backend and Infra Lead \| FastAPI server, WebSocket handling, Docker, HF Space deploy, logs, replay endpoints \| supports local dev scripts \|
	\| Person D \| Frontend and Storytelling Lead \| React plus Vite UI, live negotiation display, replay viewer, README, demo flow, video assets \| supports final integration testing \|

	### Shared responsibilities

	\| Shared area \| Expectation \|
	\| --- \| --- \|
	\| Git hygiene \| every feature goes through branch plus PR \|
	\| Integration \| merge to main only after quick smoke test \|
	\| Testing \| each owner writes tests for their workstream \|
	\| Storytelling \| everyone contributes screenshots, gifs, examples \|
	\| Submission readiness \| all four review final demo, notebook, README, repo visibility \|

	## 4.1 Training compute and model selection

	1. The team has access to an H100 GPU for heavier Scientist and Lab Manager training and evaluation runs.
	2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
	3. The judged artifact remains the Colab notebook, but the primary heavy-runtime path is now a Northflank H100 GPU job with persistent volume checkpoints and caches.
	4. Person C supports any environment URL, secret, volume, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.

	### Trainable model

	The primary shared base model for the current training iteration is
	Qwen3.5-9B.

	\| Model \| Role \| Rationale \|
	\| --- \| --- \| --- \|
	\| Qwen3.5-9B \| Primary shared base for Scientist and Lab Manager adapters \| Fits the Northflank H100 plan, upgrades the repo from the older Qwen3-8B baseline, and keeps both trainable role artifacts on one model family. \|
	\| Qwen3.5-4B \| Reduced-scale fallback \| Use for Colab or lower-memory debug runs when faster iteration matters more than final V2 quality. \|
	\| Qwen3.5-122B-A10B \| Audit-only judge candidate \| Useful for qualitative post-run analysis, but not part of the deterministic training reward loop. \|

	### Evaluator layer

	The training reward is always the deterministic rubric engine defined in E05. Anthropic is the active hosted oracle provider for post-episode explanation, scenario enrichment, and demo audit only. The frontier evaluator is never part of the training reward loop.

	### MVP role implementations

	\| Role \| MVP implementation \| Future stretch \|
	\| --- \| --- \| --- \|
	\| Scientist \| Trainable GRPO policy (`Qwen3.5-9B` + LoRA) \| Larger model distillation or curriculum extensions \|
	\| Lab Manager \| Deterministically grounded role with a trainable SFT adapter on `Qwen3.5-9B` \| Manager orchestrator with specialist subagents and richer role-specific adapters \|
	\| Judge (training reward) \| Deterministic rubric engine \| Unchanged \|
	\| Judge (explanation layer) \| Optional large-model audit layer such as `Qwen3.5-122B-A10B` or Anthropic \| Extended explanation panel in UI \|

	## 4.2 Domain roadmap and normalized scenario layer

	The frozen outer action and observation contract from `FND 08`, `MOD 01`, `MOD 02`, and `MOD 03` remains stable. Domain expansion happens below that contract through a normalized scenario layer.

	The internal data flow is:

	`scenario adapter -> normalized scenario pack -> observation mapper -> ScientistObservation or LabManagerObservation`

	Every scenario family must emit the same normalized scenario pack with, at minimum:

	1. `domain_id`
	2. `task_summary`
	3. `success_criteria`
	4. `constraints`
	5. `resources`
	6. `allowed_substitutions`
	7. `hidden_reference_spec`
	8. `scenario_id`
	9. `seed`

	Rules for the normalized scenario layer:

	1. Domain-specific logic belongs in thin adapters, not in prompts or reward code.
	2. Prompts must be assembled from the normalized scenario pack, not hard-coded to one domain.
	3. Difficulty and curriculum changes should mechanically alter constraints, resources, or conflicts rather than fork separate prompt logic.
	4. The deterministic scorer compares the final agreed plan against `hidden_reference_spec`; model-backed roles never own truth.

	For the bounded-tool MVP, pending scenario and environment work will extend the
	existing normalized scenario pack with additive evidence fields. This is an
	extension below the frozen outer contract, not a reopening of `FND 08`,
	`MOD 01`, `MOD 02`, or `MOD 03`.

	Tool-capable scenario extensions:

	1. `evidence_pack`
	2. `artifact_refs`
	3. `allowed_tools`
	4. `tool_budget`
	5. `validation_policy`

	## 4.3 Bounded tool capability policy

	The richer-capability MVP keeps the final outward action contract stable while
	adding bounded tools below it.

	### Scientist allowed capabilities

	1. `search_evidence`
	- retrieve supporting facts, benchmark rules, paper details, or official references
	- not a reward source
	2. `run_code_check`
	- bounded code or config analysis, metric checks, value generation, runtime or cost estimation
	3. `inspect_image`
	- read tables, plots, figures, screenshots, and charts for evidence extraction

	### Lab Manager allowed capabilities

	1. `search_resources`
	- retrieve resource, policy, benchmark, or documentation constraints
	2. `run_code_check`
	- validate cost, runtime, config, reproducibility, or execution assumptions
	3. `inspect_image`
	- inspect figures, charts, and screenshots relevant to feasibility or policy review

	### Judge capability rules

	1. The judge reward remains deterministic and must not depend on live web search.
	2. Tool traces and evidence references may inform deterministic penalties, bonuses, or audit text.
	3. The judge may use bounded evidence verification for demo or audit text, but never as the training reward source.

	### Training and demo rules

	1. Training uses frozen evidence packs and deterministic tool traces whenever possible.
	2. Live web search is limited to demo-time or eval-time validation, not the core training reward loop.
	3. Image generation and audio are excluded from the policy loop for the hackathon MVP.
	4. Coding capability must stay sandboxed and task-scoped rather than open-ended.

	---

	## 5. Module and function ownership map

	\| Module or file \| Key functions or classes \| Owner \| Notes \|
	\| --- \| --- \| --- \| --- \|
	\| `replicalab/models.py` \| `ScientistAction`, `LabManagerAction`, `Observation`, `StepResult`, `EpisodeState`, `EpisodeLog` \| Person A with Person B \| shared contract file \|
	\| `replicalab/scenarios/templates.py` \| `generate_scenario()`, `load_template()`, `apply_difficulty()`, `seed_rng()` \| Person A \| central normalized scenario factory and mapper inputs \|
	\| `replicalab/scenarios/math_reasoning.py` \| `build_math_reasoning_template()` \| Person A \| first structured reasoning scenario \|
	\| `replicalab/scenarios/ml_benchmark.py` \| `build_ml_benchmark_template()` \| Person A \| first reproducible compute scenario \|
	\| `replicalab/scenarios/finance_trading.py` \| `build_finance_trading_template()` \| Person A \| offline strategy and backtest planning only \|
	\| `replicalab/agents/scientist_policy.py` \| `build_scientist_prompt()`, `parse_scientist_output()` \| Person B \| trainable role \|
	\| `replicalab/agents/lab_manager_policy.py` \| `generate_lab_manager_response()`, `check_feasibility()` \| Person B with Person A \| model-backed negotiation grounded by deterministic checker \|
	\| `replicalab/agents/judge_policy.py` \| `explain_judgement()` optional only \| Person A \| explanation layer only \|
	\| `replicalab/tools/search.py` \| `search_evidence()`, `search_resources()` \| Person B with Person C \| bounded retrieval and validation only \|
	\| `replicalab/tools/code_tools.py` \| `run_code_check()` \| Person B \| bounded code analysis, config checks, and derived-value generation \|
	\| `replicalab/tools/image_tools.py` \| `inspect_image()` \| Person B with Person D \| bounded table, chart, figure, and screenshot inspection \|
	\| `replicalab/scoring/rigor.py` \| `score_rigor()` \| Person A \| deterministic \|
	\| `replicalab/scoring/feasibility.py` \| `score_feasibility()` \| Person A \| deterministic \|
	\| `replicalab/scoring/fidelity.py` \| `score_fidelity()` \| Person A \| deterministic \|
	\| `replicalab/scoring/rubric.py` \| `compute_total_reward()`, `build_reward_breakdown()` \| Person A \| core reward \|
	\| `replicalab/utils/validation.py` \| `validate_protocol()`, `validate_vocab()` \| Person A \| schema and semantic checks \|
	\| `replicalab/utils/logging.py` \| `write_episode_log()`, `write_reward_csv()` \| Person C \| logging helpers \|
	\| `replicalab/env/replicalab_env.py` \| `ReplicaLabEnv.reset()`, `step()`, `state()`, `close()` \| Person A \| OpenEnv environment \|
	\| `server/app.py` \| `create_app()`, REST routes, WebSocket handler \| Person C \| runtime entrypoint \|
	\| `server/Dockerfile` \| build and run app \| Person C \| deployment \|
	\| `frontend/src/App.tsx` \| app shell \| Person D \| UI root \|
	\| `frontend/src/components/*` \| paper panel, log panel, score panel, controls, replay, judge audit \| Person D \| UI components \|
	\| `frontend/vite.config.ts` \| dev proxy and build output config \| Person C with Person D \| frontend and backend integration \|
	\| `frontend/tailwind.config.ts` and `frontend/postcss.config.js` \| theme tokens and CSS pipeline \| Person D \| matches declared styling stack \|
	\| `notebooks/train_colab.ipynb` \| setup, connect, rollout, train, plot \| Person B \| judged asset \|
	\| `replicalab/training/*.py` \| reusable dataset, GRPO, SFT, evaluation, plotting, and job-entrypoint helpers \| Person B \| shared by notebook, Northflank H100 jobs, and evaluation scripts \|
	\| `tests/*` \| unit and integration tests \| all \| each owner covers own modules \|
	\| `openenv.yaml` \| environment registration and server config \| Person A \| required for OpenEnv discovery \|
	\| `replicalab/config.py` \| `MAX_ROUNDS`, `DEFAULT_DIFFICULTY`, `TIMEOUT_SECONDS`, `MAX_BUDGET` \| Person A \| single source of truth for constants \|
	\| `replicalab/client.py` \| `ReplicaLabClient.connect()`, `reset()`, `step()`, `close()` \| Person B \| reusable by notebook and external consumers \|
	\| `replicalab/utils/seed.py` \| `seed_rng()`, `get_deterministic_seed()` \| Person A \| shared by scenarios and env \|
	\| `replicalab/prompts/*.txt` \| role prompt templates \| Person B \| loadable domain-neutral text files assembled from normalized scenario data \|
	\| `replicalab/outputs/` \| `logs/`, `replays/`, `plots/` \| Person C \| gitignored output directories \|
	\| `server/requirements.txt` \| pinned runtime dependencies \| Person C \| standalone server install \|
	\| `README.md` \| project story, setup, results \| Person D with all \| judged asset \|

	---

	## 6. Delivery phases

	\| Phase \| Goal \| Exit condition \|
	\| --- \| --- \| --- \|
	\| Phase 0 \| contracts and scaffolding \| repo, schema, branch rules, basic app skeleton \|
	\| Phase 1 \| one working scenario end to end \| reset, step, reward, logs work locally \|
	\| Phase 2 \| deployable environment \| FastAPI, Docker, HF Space live \|
	\| Phase 3 \| trainable loop \| Colab notebook connects and shows non flat rewards \|
	\| Phase 4 \| compelling demo \| UI, replay, reward breakdown, README, video \|
	\| Phase 5 \| hardening \| smoke tests, bug fixes, final submission review \|

	---

	## 7. Operating workflow

	## 7.1 Branching model

	\| Branch type \| Example \| Rule \|
	\| --- \| --- \| --- \|
	\| main \| `main` \| always demo safe \|
	\| feature \| `feature/env-reset-loop` \| one feature per branch \|
	\| hotfix \| `hotfix/ws-timeout-fix` \| used only for urgent breaks \|

	## 7.2 PR checklist

	Every PR must include:

	1. linked task ID
	2. summary of change
	3. screenshots or logs if UI or environment behavior changed
	4. quick test result
	5. note on any schema or API changes

	## 7.3 Integration cadence

	1. Sync at start of day
	2. Merge every 2 to 3 hours if stable
	3. End of block smoke test on:
	1. local reset
	2. one full episode
	3. frontend load
	4. notebook connection if applicable

	---

	## 8. Epic backlog

	### Status legend

	- `✅ Completed`
	- `❌ Failed`
	- `🟡 Partial`
	- `⬜ Not started`
	- `Completed by`: fill this only when the finisher is different from the assigned owner; otherwise use `—`

	---

	## Epic E01. Foundations and repository setup

	### Epic goal
	Create a stable shared codebase, contracts, and development workflow so all workstreams can proceed in parallel.

	### Current status

	- `FND 01` status: completed on 2026-03-07
	- `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
	- `FND 02` status: completed on 2026-03-08
	- `FND 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
	- `FND 04` status: completed on 2026-03-08
	- `FND 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `FND 05` status: completed on 2026-03-08
	- `FND 05` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
	- `FND 06` status: completed on 2026-03-08
	- `FND 06` completed by: `Person B (Ayush)` while the assigned owner remains `Person D`
	- `FND 07` status: completed on 2026-03-08
	- `FND 07` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
	- `FND 08` status: completed on 2026-03-08
	- `FND 08` completed by: `Person A (Kian)` and `Person B (Ayush)` with shared sign-off recorded in `docs/fnd08_frozen_json_contract.md`
	- `FND 09` status: completed on 2026-03-08
	- `FND 09` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `FND 11` status: completed on 2026-03-08
	- `FND 11` completed by: `Max (Person C)`; the branch import and standards validation were handled by `Person B (Ayush)`
	- `FND 10` status: completed on 2026-03-07
	- `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
	- Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/` and `frontend/src/` subfolders from the planned layout
	- Completed scope for `FND 02`: added `pyproject.toml` with package metadata, Python version floor, runtime dependencies, dev extras, and basic pytest discovery settings; verified editable install and shared-model imports
	- Completed scope for `FND 04`: added importable empty Pydantic model stubs in `replicalab/models.py` for the shared action, observation, step, state, and log contracts
	- Completed scope for `FND 05`: created `.dockerignore` and expanded `.gitignore` to cover Python, Node, notebook, coverage, cache, and generated output artifacts while preserving tracked `.gitkeep` scaffold files
	- Completed scope for `FND 06`: replaced the aspirational README with a temporary foundation stub that reflects the actual repo state, mission, team ownership, and current local setup placeholder
	- Completed scope for `FND 07`: added GitHub PR and task-issue templates and tightened the repo workflow rules for branch naming and required tracking-doc updates
	- Completed scope for `FND 08`: added `docs/fnd08_frozen_json_contract.md` with field semantics, enums, nested object schemas, null-vs-empty rules, canonical JSON examples for all 8 shared models, and final shared sign-off
	- Completed scope for `FND 09`: added `openenv.yaml` with OpenEnv manifest metadata plus the minimal repo wiring required for local OpenEnv validation (`openenv-core` dependency, `server` script entry point, `uv.lock`, and `server.app.main()`)
	- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
	- Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
	- Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
	- Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
	- Backend and deployment scope imported from Max's PR has now been normalized onto the current standards, validated against the real env, Docker-verified locally, and extended with HF Spaces metadata plus deployment instructions
	- Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
	- Newly unblocked by `FND 06`: `DOC 01`
	- Newly unblocked by `FND 03`: `FND 13`, `UI 01`
	- Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
	- Remaining completion items for the backend and deployment path: live HF Space bring-up (`API 10`), secrets documentation (`API 17`), replay persistence, and the remaining partial API polish tasks
	- Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
	- Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
	- Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
	- Newly unblocked by `SCN 11` and `AGT 01`: `AGT 02`, `AGT 11`, `TRN 04`, `TRN 08`
	- Remaining Epic E03 work after the scenario bundle: `SCN 12`

	### User stories

	US E01.1
	As a developer, I want a clean repo and file layout so I can build without stepping on other people’s work.

	US E01.2
	As a team, we want agreed schemas and coding rules so integration risk stays low.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| FND 01 \| E01.1 \| Person C \| repo root \| Create repo structure and base folders from agreed layout \| none \| 0.5h \| all top level folders exist and repo clones cleanly \| ✅ Completed \| Person B (Ayush) \|
	\| FND 02 \| E01.1 \| Person C \| `pyproject.toml` \| Add Python project config and dependencies placeholder \| FND 01 \| 0.5h \| project installs locally without missing package errors for base modules \| ✅ Completed \| Person B (Ayush) \|
	\| FND 03 \| E01.1 \| Person C \| `frontend/package.json` \| Initialize React plus Vite frontend shell \| FND 01 \| 0.5h \| `npm install` and dev server run successfully \| ✅ Completed \| Kush \|
	\| FND 04 \| E01.2 \| Person A \| `replicalab/models.py` \| Add empty Pydantic models and shared type names \| FND 01 \| 0.5h \| import paths resolve for all placeholder models \| ✅ Completed \| Person B (Ayush) \|
	\| FND 05 \| E01.2 \| Person C \| `.gitignore` and `.dockerignore` \| Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean \| FND 01 \| 0.25h \| repo status stays clean after local run and build, and Docker build excludes non-runtime files \| ✅ Completed \| Person B (Ayush) \|
	\| FND 06 \| E01.2 \| Person D \| `README.md` \| Add temporary project stub with title, mission, team roles, and local setup placeholder \| FND 01 \| 0.5h \| new contributor can understand repo purpose in under two minutes \| ✅ Completed \| Person B (Ayush) \|
	\| FND 07 \| E01.2 \| Person C \| repo settings \| Define branch naming, PR template, and issue template \| FND 01 \| 0.5h \| all future PRs auto show the template and issue fields \| ✅ Completed \| Person B (Ayush) \|
	\| FND 08 \| E01.2 \| Person A and B \| docs or backlog file \| Freeze JSON contract for actions and observations \| FND 04 \| 0.75h \| all owners sign off and no blocking contract ambiguity remains \| ✅ Completed \| Person A (Kian) and Person B (Ayush) \|
	\| FND 09 \| E01.2 \| Person A \| `openenv.yaml` \| Create OpenEnv configuration file specifying environment class, action and observation types, and server settings \| FND 04 \| 0.5h \| OpenEnv can discover and serve the environment using this config file \| ✅ Completed \| Person B (Ayush) \|
	\| FND 10 \| E01.1 \| Person C \| `replicalab/outputs/` \| Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore \| FND 01 \| 0.25h \| output directories exist and generated files are not committed to git \| ✅ Completed \| Person B (Ayush) \|
	\| FND 11 \| E01.1 \| Person C \| `server/requirements.txt` \| Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies \| FND 02 \| 0.25h \| server can be installed from requirements.txt independently of pyproject.toml \| ✅ Completed \| Max (Person C) \|
	\| FND 12 \| E01.1 \| Person C \| `frontend/vite.config.ts` \| Create Vite config with API and WebSocket proxy support for local development plus stable build output settings \| FND 03 \| 0.5h \| frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging \| ✅ Completed \| Kush \|
	\| FND 13 \| E01.1 \| Person D \| `frontend/tailwind.config.ts` and `frontend/postcss.config.js` \| Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles \| FND 03 \| 0.75h \| frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors \| ✅ Completed \| Kush (Tailwind v4.2 with @theme CSS vars, cva+clsx, light/dark mode) \|

	---

	## Epic E02. Domain models, validation, and state contracts

	### Epic goal
	Define the environment contracts cleanly so state, actions, and observations are deterministic and easy to train against.

	### Current status

	- `MOD 01` status: completed on 2026-03-08
	- `MOD 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 02` status: completed on 2026-03-08
	- `MOD 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 03` status: completed on 2026-03-08
	- `MOD 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 04` status: completed on 2026-03-08
	- `MOD 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 05` status: completed on 2026-03-08
	- `MOD 05` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 11` status: completed on 2026-03-08
	- `MOD 11` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 12` status: completed on 2026-03-08
	- `MOD 12` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `MOD 09` status: completed on 2026-03-08
	- Completed scope for `MOD 01`: replaced the placeholder `ScientistAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, added focused schema tests, and patched the stub server so `accept` no longer overwrites the current protocol with default values
	- Completed scope for `MOD 02`: replaced the placeholder `LabManagerAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency across budget, equipment, reagent, schedule, and staff checks, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests
	- Completed scope for `MOD 03`: introduced typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the current stub server and focused tests
	- Completed scope for `MOD 04`: replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs
	- Completed scope for `MOD 05`: added deterministic semantic protocol validation in `replicalab/utils/validation.py` with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack
	- Completed scope for `MOD 11`: introduced typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly
	- Completed scope for `MOD 12`: added `replicalab/config.py` as the shared constants module for default scenario, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults; updated the server and scenario builders to import those constants instead of repeating magic numbers
	- Completed scope for `MOD 09`: added `replicalab/agents/scientist_policy.py` with a raw-text parser that extracts JSON from plain text or fenced blocks, validates it into `ScientistAction`, and raises an explicit `ScientistOutputParseError` for missing JSON, invalid JSON, or schema failures; added focused parser tests and package exports
	- Newly unblocked by `MOD 01`: `MOD 05`, `MOD 09`
	- Newly unblocked by `MOD 03`: `MOD 04`, `MOD 11`
	- Newly unblocked by `MOD 04`: `MOD 07`, `ENV 01`
	- Newly unblocked by `MOD 05`: `MOD 06`, `AGT 05`
	- `MOD 11` does not introduce a new formal dependency edge by itself, but it stabilizes `StepResult` metadata for environment, API, replay, and training consumers
	- `MOD 09` does not fully unblock a new task by itself, but it removes one half of the blocker on `AGT 03`; `AGT 03` now only waits on `AGT 02`

	### User stories

	US E02.1
	As the environment, I need typed actions and observations so invalid messages can be rejected early.

	US E02.2
	As the training loop, I need deterministic state serialization so episodes can be replayed and compared.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| MOD 01 \| E02.1 \| Person A \| `replicalab/models.py` \| Implement `ScientistAction` schema \| FND 08 \| 0.5h \| valid scientist actions parse and invalid fields raise validation errors \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 02 \| E02.1 \| Person A \| `replicalab/models.py` \| Implement `LabManagerAction` schema \| FND 08 \| 0.5h \| valid lab manager actions parse and invalid fields raise validation errors \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 03 \| E02.1 \| Person A \| `replicalab/models.py` \| Implement role specific `Observation` models \| FND 08 \| 0.75h \| scientist and lab observations serialize to JSON with stable keys \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 04 \| E02.2 \| Person A \| `replicalab/models.py` \| Implement `EpisodeState` and `EpisodeLog` models \| MOD 03 \| 0.75h \| full state round trip serialize plus deserialize works \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 05 \| E02.1 \| Person A \| `replicalab/utils/validation.py` \| Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab \| MOD 01 \| 1h \| invalid protocol examples are rejected with readable reasons \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 06 \| E02.1 \| Person A \| `replicalab/utils/validation.py` \| Add semantic validators for impossible plans such as zero sample size with positive controls \| MOD 05 \| 0.75h \| semantic validator catches at least five invalid edge cases \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 07 \| E02.2 \| Person C \| `replicalab/utils/logging.py` \| Add state serialization helper for replay logs \| MOD 04 \| 0.5h \| state logs can be written and loaded without loss \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 08 \| E02.2 \| Person A \| tests \| Write unit tests for schemas and validators \| MOD 01 to MOD 07 \| 1h \| tests cover valid parse, invalid parse, and replay serialization \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 09 \| E02.2 \| Person B \| `replicalab/agents/scientist_policy.py` \| Add output parser that maps model text to `ScientistAction` \| MOD 01 \| 0.75h \| parser returns structured action or explicit parse error \| ✅ Completed \| — \|
	\| MOD 10 \| E02.2 \| Person C \| API docs \| Publish schema examples for frontend and notebook clients \| MOD 01 to MOD 04 \| 0.5h \| frontend and notebook can mock against shared sample payloads \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 11 \| E02.1 \| Person A \| `replicalab/models.py` \| Implement `StepResult` model with observation, reward, done flag, and info dict \| MOD 03 \| 0.5h \| step result serializes cleanly and all consumers agree on its shape \| ✅ Completed \| Person B (Ayush) \|
	\| MOD 12 \| E02.2 \| Person A \| `replicalab/config.py` \| Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit \| FND 08 \| 0.5h \| all modules import config from one place and no magic numbers remain in env or scoring code \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E03. Scenario engine and constraint generation

	### Epic goal
	Generate deterministic, varied, and internally consistent technical scenarios through a normalized scenario layer.

	### User stories

	US E03.1
	As a user, I want seeded scenarios so I can replay identical tasks.

	US E03.2
	As a judge, I want normalized constraints and resources so the environment tests real tradeoffs across domains without changing the outer contract.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| SCN 01 \| E03.1 \| Person A \| `replicalab/utils/seed.py` \| Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module \| FND 08 \| 0.5h \| same seed always yields the same random choices and seed module is importable from scenarios and env \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 02 \| E03.1 \| Person A \| `replicalab/scenarios/templates.py` \| Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec \| MOD 04 \| 0.75h \| all scenario builders return the same normalized top level structure and mapper-ready inputs \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 03 \| E03.2 \| Person A \| `replicalab/scenarios/math_reasoning.py` \| Implement mathematics template with theorem, proof-goal, tool, time, and review constraints \| SCN 02 \| 1h \| generated scenario passes structure and internal consistency tests \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 04 \| E03.2 \| Person A \| `replicalab/scenarios/ml_benchmark.py` \| Implement ML benchmark template with dataset, compute, time, and evaluation constraints \| SCN 02 \| 1h \| generated scenario passes structure and internal consistency tests \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 05 \| E03.2 \| Person A \| `replicalab/scenarios/finance_trading.py` \| Implement finance and trading planning template with risk, capital, slippage, and backtest constraints \| SCN 02 \| 1h \| generated scenario passes structure and internal consistency tests \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 06 \| E03.1 \| Person A \| `replicalab/scenarios/templates.py` \| Implement difficulty application for easy, medium, hard by mechanically altering constraints, resources, and conflicts \| SCN 03 to SCN 05 \| 1h \| difficulty visibly changes the normalized scenario pack in a meaningful way \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 07 \| E03.2 \| Person A \| `replicalab/scenarios/templates.py` \| Implement normalized constraint and resource generator for budget, time, compute, personnel, stock, and bookings \| SCN 02 \| 1.25h \| no generated scenario contains contradictory constraints or resources \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 08 \| E03.2 \| Person A \| `replicalab/scenarios/templates.py` \| Implement hidden reference spec and allowed substitutions per template \| SCN 03 to SCN 05 \| 1h \| hidden reference clearly marks what is fixed versus flexible for deterministic scoring \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 09 \| E03.1 \| Person A \| `replicalab/scenarios/templates.py` \| Implement `generate_scenario(seed, template, difficulty)` \| SCN 01 to SCN 08 \| 0.75h \| function returns a full scenario with deterministic content \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 10 \| E03.1 \| Person A \| tests \| Add seeded generation tests and consistency tests \| SCN 09 \| 1h \| same seed plus template returns same scenario and different seeds vary \| ✅ Completed \| Person B (Ayush) \|
	\| SCN 11 \| E03.2 \| Person B \| fixtures \| Create hand checked golden scenarios for prompt testing \| SCN 09 \| 0.75h \| three fixed scenarios are available for deterministic manual testing \| ✅ Completed \| — \|
	\| SCN 12 \| E03.2 \| Person D \| docs \| Write plain language scenario summaries for UI examples and README \| SCN 03 to SCN 05 \| 0.5h \| each template has a clean one paragraph explanation for judges \| ✅ Completed \| Person B (Ayush) - README scenario summaries aligned with actual math/ML/finance templates \|
	\| SCN 13 \| E03.2 \| Person A \| `replicalab/scenarios/templates.py` \| Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration \| SCN 07 \| 1h \| constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E04. Scientist agent and Lab Manager policy

	### Epic goal
	Create the interactive roles that operate inside the environment while keeping truth in deterministic checkers and reward logic.

	### User stories

	US E04.1
	As the Scientist agent, I want a structured action space so I can learn consistent policy behavior.

	US E04.2
	As the Lab Manager, I want grounded negotiation plus deterministic feasibility checks so the environment remains stable and fair.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| AGT 01 \| E04.1 \| Person B \| `replicalab/agents/scientist_policy.py` \| Draft domain-neutral system prompt for Scientist role from normalized scenario data \| MOD 01, SCN 11 \| 0.75h \| prompt clearly explains role, mapped constraints, and JSON output contract \| ✅ Completed \| — \|
	\| AGT 02 \| E04.1 \| Person B \| `replicalab/agents/scientist_policy.py` \| Build observation to prompt formatting helper from normalized scenario-derived observations \| AGT 01, MOD 03 \| 0.75h \| formatted prompt includes task info, history, and action schema consistently \| ✅ Completed \| — \|
	\| AGT 03 \| E04.1 \| Person B \| `replicalab/agents/scientist_policy.py` \| Add parse plus retry strategy for malformed model output \| MOD 09, AGT 02 \| 0.75h \| malformed output triggers at least one controlled retry or explicit failure \| ✅ Completed \| — \|
	\| AGT 04 \| E04.1 \| Person B \| `replicalab/agents/scientist_policy.py` \| Build baseline heuristic Scientist for non trained smoke tests \| AGT 02 \| 1h \| baseline can complete episodes without crashing \| ✅ Completed \| — \|
	\| AGT 05 \| E04.2 \| Person A and B \| `replicalab/agents/lab_manager_policy.py` \| Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules \| SCN 07, MOD 05 \| 1.25h \| checker returns clear pass or fail per constraint dimension \| ✅ Completed \| Person B (Ayush) \|
	\| AGT 06 \| E04.2 \| Person B \| `replicalab/agents/lab_manager_policy.py` \| Implement alternative suggestion logic from allowed substitutions and resource tradeoffs \| AGT 05, SCN 08 \| 1h \| lab manager can suggest at least one sensible revision when initial plan fails \| ✅ Completed \| — \|
	\| AGT 07 \| E04.2 \| Person B \| `replicalab/agents/lab_manager_policy.py` \| Add model-backed response synthesis from feasibility results and suggested revisions \| AGT 05 \| 0.75h \| output is readable, grounded in checker results, and maps cleanly to underlying checks \| ✅ Completed \| — \|
	\| AGT 08 \| E04.1 \| Person B \| tests \| Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy \| AGT 01 to AGT 04 \| 0.75h \| tests cover happy path, malformed output handling, and stable tool-policy reminders \| ✅ Completed \| — \|
	\| AGT 09 \| E04.2 \| Person A \| tests \| Add deterministic feasibility checker tests for Lab Manager grounding \| AGT 05 to AGT 07 \| 0.75h \| same proposal plus same normalized scenario returns the same checker results every time \| ✅ Completed \| Person B (Ayush) \|
	\| AGT 10 \| E04.1 \| Person B \| `replicalab/prompts/` \| Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt`, including bounded rules for search, code checks, and image inspection \| AGT 01, AGT 07, JDG 06 \| 0.75h \| prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior \| ✅ Completed \| — \|
	\| AGT 11 \| E04.1 \| Person B \| docs \| Select and document base model for Scientist training with rationale for model size, license, and structured output capability \| AGT 01 \| 0.5h \| decision is recorded and all team members know which model will be fine tuned \| ✅ Completed \| — \|

	---

	## Epic E05. Judge engine and reward logic

	### Epic goal
	Score the final plan fairly, explainably, and deterministically against the hidden reference spec.

	### User stories

	US E05.1
	As the training system, I need a stable reward so the model can improve.

	US E05.2
	As a judge, I need a readable score breakdown so I can understand why the environment rewarded or penalized the agent.

	### Executor notes

	- `JDG 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `JDG 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
	- `JDG 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| JDG 01 \| E05.1 \| Person A \| `replicalab/scoring/rigor.py` \| Implement rigor or objective-validity score for plan completeness, required checks, method quality, justification, and correct bounded evidence use when present \| SCN 08 \| 1.25h \| score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning without depending on live web results \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 02 \| E05.1 \| Person A \| `replicalab/scoring/feasibility.py` \| Implement feasibility score for budget, resources, time, staffing, compute, bookings, and deterministic tool-backed validation results \| SCN 07, AGT 05 \| 1.25h \| score is between 0 and 1 and matches normalized constraint logic plus deterministic tool outcomes \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 03 \| E05.1 \| Person A \| `replicalab/scoring/fidelity.py` \| Implement fidelity score against hidden reference spec, required steps, allowed substitutions, and supported evidence claims when present \| SCN 08 \| 1h \| score is between 0 and 1 and matches rubric examples for plan and evidence alignment \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 04 \| E05.1 \| Person A \| `replicalab/scoring/rubric.py` \| Implement total reward formula with bonuses and penalties, including deterministic penalties for invalid tool use or unsupported evidence claims \| JDG 01 to JDG 03 \| 0.75h \| total reward formula matches agreed math and returns consistent output for plan quality and bounded tool behavior \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 05 \| E05.2 \| Person A \| `replicalab/scoring/rubric.py` \| Build reward breakdown object with component scores, penalties, and tool-use diagnostics \| JDG 04 \| 0.5h \| breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 06 \| E05.2 \| Person A \| `replicalab/scoring/explain.py` \| Add optional plain English explanation function from reward breakdown \| JDG 05 \| 0.75h \| explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 07 \| E05.1 \| Person C \| `replicalab/utils/logging.py` \| Log reward breakdown to CSV or JSONL per episode \| JDG 05, MOD 07 \| 0.5h \| reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 08 \| E05.1 \| Person A \| tests \| Add score determinism tests and edge case tests \| JDG 01 to JDG 05 \| 1h \| perfect and broken protocols produce expected relative ordering \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 09 \| E05.2 \| Person D \| UI mocks \| Create mock score cards and language for frontend \| JDG 05 \| 0.5h \| UI can display score breakdown from mock data \| ✅ Completed \| Kush - ScorePanel with rigor/feasibility/fidelity bars and ScoreBar component \|
	\| JDG 10 \| E05.1 \| Person B \| notebook support \| Expose component metrics for training plots \| JDG 05, JDG 07 \| 0.5h \| notebook can read average rigor, feasibility, fidelity, and bounded tool metrics over time \| ✅ Completed \| Person B (Ayush) \|
	\| JDG 11 \| E05.2 \| Person A \| `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` \| Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric \| JDG 05, JDG 06 \| 0.75h \| final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E06. OpenEnv environment implementation

	### Epic goal
	Turn the scenario, roles, and reward logic into a real OpenEnv environment.

	### User stories

	US E06.1
	As a client, I want `reset()` to start a clean, seeded episode.

	US E06.2
	As a client, I want `step()` to advance one turn and return observation, reward, and done.

	US E06.3
	As a judge, I want deterministic replay and cleanup.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| ENV 01 \| E06.1 \| Person A \| `replicalab/env/replicalab_env.py` \| Create `ReplicaLabEnv` class skeleton \| MOD 04, SCN 09 \| 0.5h \| environment class imports and instantiates without runtime errors \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 02 \| E06.1 \| Person A \| `replicalab/env/replicalab_env.py` \| Implement `reset(seed, template, difficulty)` \| ENV 01, SCN 09 \| 1h \| reset returns initial observations and a fresh episode state \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 03 \| E06.2 \| Person A \| `replicalab/env/replicalab_env.py` \| Implement internal Scientist turn application and bounded tool mediation \| ENV 02, AGT 05 \| 1h \| valid Scientist action plus any allowed tool traces update state and history correctly \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 04 \| E06.2 \| Person A \| `replicalab/env/replicalab_env.py` \| Implement internal Lab Manager response step with bounded validation tools \| ENV 03, AGT 07 \| 1h \| lab manager response plus any supporting bounded tool traces are appended and returned in the next observation \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 05 \| E06.2 \| Person A \| `replicalab/env/replicalab_env.py` \| Implement accept, timeout, and max round logic \| ENV 03, ENV 04 \| 0.75h \| episode terminates correctly on agreement or round limit \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 06 \| E06.2 \| Person A \| `replicalab/env/replicalab_env.py` \| Integrate reward computation at finalization and optional intermediate score previews \| ENV 05, JDG 05 \| 1h \| final step returns total reward, breakdown info, and deterministic penalties or bonuses for bounded tool behavior \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 07 \| E06.3 \| Person A \| `replicalab/env/replicalab_env.py` \| Implement `state()` \| ENV 02 to ENV 06 \| 0.5h \| current environment state can be retrieved for debugging and replay \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 08 \| E06.3 \| Person A \| `replicalab/env/replicalab_env.py` \| Implement `close()` cleanup \| ENV 01 \| 0.25h \| close frees any transient resources and does not throw \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 09 \| E06.3 \| Person C \| `replicalab/utils/logging.py` \| Write episode logs on completion \| ENV 06, JDG 07 \| 0.5h \| completed episodes generate replayable logs automatically \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 10 \| E06.1 to E06.3 \| Person A \| tests \| Add reset, step, invalid action, timeout, and deterministic replay tests \| ENV 02 to ENV 09 \| 1.25h \| tests pass for seeded reset, valid step, invalid step, and replay consistency \| ✅ Completed \| Person B (Ayush) \|
	\| ENV 11 \| E06.2 \| Person A \| `replicalab/env/replicalab_env.py` \| Attach judge audit payload to final `StepResult`, terminal observations, and replay state \| ENV 06, JDG 11 \| 0.5h \| completed episodes expose audit notes alongside reward breakdown in a stable schema \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E07. API, server, Docker, and deployment

	### Epic goal
	Serve the environment reliably for frontend users and training clients, then deploy it to Hugging Face Spaces.

	### User stories

	US E07.1
	As a client, I want to connect over WebSocket or REST to interact with the environment remotely.

	US E07.2
	As the team, we want one click reproducible deployment to HF Spaces.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| API 01 \| E07.1 \| Person C \| `server/app.py` \| Create FastAPI app shell and health endpoint \| ENV 01 \| 0.5h \| `GET /health` returns 200 with simple payload \| ✅ Completed \| Person B (Ayush) \|
	\| API 02 \| E07.1 \| Person C \| `server/app.py` \| Add `POST /reset` endpoint \| ENV 02 \| 0.75h \| reset endpoint starts a new episode and returns initial observation \| ✅ Completed \| Person B (Ayush) \|
	\| API 03 \| E07.1 \| Person C \| `server/app.py` \| Add `POST /step` endpoint \| ENV 06 \| 0.75h \| step endpoint accepts valid action and returns step result \| ✅ Completed \| Person B (Ayush) \|
	\| API 04 \| E07.1 \| Person C \| `server/app.py` \| Add `GET /scenarios` endpoint \| SCN 03 to SCN 05 \| 0.5h \| endpoint lists available scenario families and difficulties \| ✅ Completed \| Person B (Ayush) \|
	\| API 05 \| E07.1 \| Person C \| `server/app.py` \| Add `GET /replay/{episode_id}` endpoint \| ENV 09 \| 0.75h \| endpoint returns completed log for valid episode id \| ✅ Completed \| Person B (Ayush) \|
	\| API 06 \| E07.1 \| Person C \| `server/app.py` \| Add WebSocket session handler \| ENV 06 \| 1.25h \| each connection gets isolated environment state and can reset plus step \| ✅ Completed \| Person B (Ayush) \|
	\| API 07 \| E07.1 \| Person C \| `server/app.py` \| Add idle timeout and graceful disconnect cleanup \| API 06, ENV 08 \| 0.75h \| stale connections close cleanly and environment closes without leak \| ✅ Completed \| Person B (Ayush) \|
	\| API 08 \| E07.2 \| Person C \| `server/Dockerfile` \| Build Dockerfile with Python app startup on port 7860 \| API 01 to API 07 \| 0.75h \| local Docker run serves app on port 7860 \| ✅ Completed \| Person B (Ayush) \|
	\| API 09 \| E07.2 \| Person C \| HF config files \| Add Hugging Face Space metadata and deploy instructions \| API 08 \| 0.5h \| Space config is valid for Docker app deployment \| ✅ Completed \| Person B (Ayush) \|
	\| API 10 \| E07.2 \| Person C \| deployment docs \| Deploy live Space and verify health, reset, and step \| API 09 \| 1h \| live Space responds successfully to health and one end to end episode \| ✅ Completed \| Person B (Ayush) \|
	\| API 11 \| E07.1 \| Person C \| tests \| Add server endpoint tests and WebSocket smoke test \| API 01 to API 07 \| 1h \| local server tests pass for health, reset, step, invalid payload, and ws connect \| ✅ Completed \| Person B (Ayush) \|
	\| API 12 \| E07.2 \| Person D \| docs \| Capture deployment screenshots and public link for README \| API 10 \| 0.25h \| README ready screenshots and live link are available \| ✅ Completed \| Person B (Ayush) - live HF Space link in README, screenshot guide in docs/recording_guide.md \|
	\| API 13 \| E07.1 \| Person C \| `server/app.py` \| Add CORS middleware configuration for frontend origins in dev and production \| API 01 \| 0.25h \| frontend on localhost:5173 and HF Space origin can reach the API without CORS errors \| ✅ Completed \| Person B (Ayush) \|
	\| API 14 \| E07.1 \| Person C \| `server/app.py` \| Add REST session management so each user gets isolated environment state \| API 02, API 03 \| 0.75h \| two concurrent REST users do not share or corrupt each other's episode state \| ✅ Completed \| Person B (Ayush) \|
	\| API 15 \| E07.2 \| Person C \| HF Space repo \| Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji \| API 08 \| 0.25h \| HF Space config is valid and Space launches correctly from the metadata \| ✅ Completed \| Person B (Ayush) \|
	\| API 16 \| E07.2 \| Person C \| `server/Dockerfile` \| Configure Docker to build frontend and serve static assets from FastAPI in a single container \| API 08, UI 10 \| 0.75h \| single Docker container serves both API and frontend on port 7860 \| ✅ Completed \| Person D (Kush) \|
	\| API 17 \| E07.2 \| Person C \| deployment docs \| Document secrets and API key management for hosted Scientist model access in deployment and notebook \| API 09 \| 0.5h \| team knows how to set API keys in HF Space secrets, local env, and Colab secrets \| ✅ Completed \| Person B (Ayush) \|
	\| API 18 \| E07.1 \| Person C \| `server/app.py` \| Include judge audit payload plus bounded tool-trace summaries in REST, replay, and WebSocket responses for terminal episodes \| API 03, API 05, API 06, ENV 11 \| 0.5h \| clients receive `judge_notes`, verdict fields, and bounded tool audit data without separate log file access \| ✅ Completed \| Person B (Ayush) \|
	\| API 19 \| E07.2 \| Person C \| `openenv.yaml` and deployment docs \| Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space \| FND 09, API 08, API 10 \| 0.5h \| `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E08. RL training pipeline and evaluation

	### Epic goal
	Train the Scientist agent and show observable reward improvement.

	### User stories

	US E08.1
	As a judge, I want a Colab notebook that clearly trains the agent and shows improvement.

	US E08.2
	As the team, we want a repeatable evaluation workflow for before versus after comparison.

	V2 note: the Scientist remains the primary RL target, but the training stack
	now also supports a separate Lab Manager SFT artifact on the same base model
	family. This is additive to the deterministic reward loop, not a replacement
	for it.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| TRN 01 \| E08.1 \| Person B \| `notebooks/train_colab.ipynb` \| Create notebook skeleton with setup, connect, train, bounded-tool policy, and plot sections \| API 10 \| 0.5h \| notebook has clear runnable sections in the right order and documents the bounded-tool policy \| ✅ Completed \| — \|
	\| TRN 02 \| E08.1 \| Person B \| notebook \| Add package install and model setup cell for Unsloth or HF TRL \| TRN 01 \| 0.75h \| notebook installs dependencies without manual edits beyond secrets \| ✅ Completed \| — \|
	\| TRN 03 \| E08.1 \| Person B \| notebook or `client.py` \| Implement environment client wrapper for reset plus step over WebSocket or REST \| API 06 \| 1h \| notebook can start and finish an episode against local or hosted env and can read tool-aware step payloads \| ✅ Completed \| — \|
	\| TRN 04 \| E08.1 \| Person B \| notebook \| Implement rollout collection loop for Scientist episodes \| TRN 03, AGT 01 \| 1h \| loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs \| ✅ Completed \| — \|
	\| TRN 05 \| E08.1 \| Person B \| notebook \| Connect rollouts to GRPO or equivalent trainer \| TRN 04 \| 1.25h \| at least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 06 \| E08.1 \| Person B \| notebook \| Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics \| JDG 10, TRN 04 \| 0.75h \| notebook stores a metrics frame across training episodes including bounded tool metrics \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 07 \| E08.2 \| Person B \| notebook \| Plot reward curve and component curves with matplotlib \| TRN 06 \| 0.5h \| plotted image shows visible metrics and can be saved to file \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 08 \| E08.2 \| Person B \| notebook \| Add before versus after evaluation on fixed seeds and frozen evidence packs \| SCN 11, TRN 05 \| 1h \| notebook compares baseline and trained policy on the same scenarios and evidence packs \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 09 \| E08.2 \| Person B \| `replicalab/agents/scientist_policy.py` \| Add policy loading path for trained adapter or checkpoint \| TRN 05 \| 0.5h \| evaluation can switch between baseline and trained model cleanly \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 10 \| E08.2 \| Person B \| docs \| Export plot image and sample logs to `outputs/plots` \| TRN 07 \| 0.25h \| plots are saved and versioned for README use \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 11 \| E08.1 \| Person C \| infra notes \| Document environment URL, secrets, and connection troubleshooting \| TRN 03 \| 0.25h \| any team member can run the notebook using the notes \| ✅ Completed \| Person B (Ayush) \|
	\| TRN 12 \| E08.2 \| Person D \| storytelling \| Convert evaluation results into two or three clear bullet insights for judges \| TRN 08 \| 0.5h \| README and demo can state what improved in plain English \| ✅ Completed \| Person B (Ayush) - "What Improved" + "Key Takeaways" sections in README \|
	\| TRN 13 \| E08.1 \| Person B \| `replicalab/client.py` \| Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket \| API 06 \| 1h \| client module can be imported by notebook and other consumers without duplicating connection logic \| ✅ Done \| 2026-03-08 \|
	\| TRN 14 \| E08.1 \| Person B \| notebook or docs \| Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability \| TRN 01 \| 0.5h \| base model choice is documented and all team members know which model is being trained \| ✅ Completed \| — \|
	\| TRN 15 \| E08.2 \| Person B \| notebook \| Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs and before versus after comparison \| TRN 06, TRN 08, OBS 09 \| 0.5h \| notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E09. Frontend, UX, replay, and demo views

	### Epic goal
	Create a judge friendly interface that makes the environment behavior obvious in seconds.

	### User stories

	US E09.1
	As a judge, I want to immediately see the paper, the negotiation, and the score.

	US E09.2
	As a team, we want a replayable UI for debugging and recording the demo.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| UI 01 \| E09.1 \| Person D \| `frontend/src/App.tsx` \| Create application shell with three panel layout \| FND 03 \| 0.75h \| app renders layout for paper, conversation, and scoring panels \| ✅ Completed \| Kush - EpisodePage 3-column grid layout \|
	\| UI 02 \| E09.1 \| Person D \| `frontend/src/components/PaperPanel.tsx` \| Build original paper summary panel \| SCN 12 \| 0.75h \| panel displays title, hypothesis, method, key finding, and seed \| ✅ Completed \| Kush \|
	\| UI 03 \| E09.1 \| Person D \| `frontend/src/components/ProtocolPanel.tsx` \| Build current protocol and diff panel \| JDG 09 \| 1h \| panel highlights current plan fields and updates after each round \| ✅ Completed \| Kush - DiffRow comparisons, equipment, reagents \|
	\| UI 04 \| E09.1 \| Person D \| `frontend/src/components/NegotiationLog.tsx` \| Build chat style negotiation log \| API 03 or API 06 \| 1h \| scientist and lab manager messages show in correct order with role styling \| ✅ Completed \| Kush - message log with auto-scroll, character avatars, role styling \|
	\| UI 05 \| E09.1 \| Person D \| `frontend/src/components/ScorePanel.tsx` \| Build rigor, feasibility, fidelity, and total score cards \| JDG 09 \| 0.75h \| score cards render component values and penalties clearly \| ✅ Completed \| Kush - ScoreBar component with rigor/feasibility/fidelity visualization \|
	\| UI 06 \| E09.2 \| Person D \| `frontend/src/components/Controls.tsx` \| Build new episode, seed input, scenario selector, and start controls \| API 02, API 04 \| 0.75h \| user can start a chosen scenario with chosen seed from UI \| ✅ Completed \| Kush - scenario selector, difficulty toggle, seed input with random button \|
	\| UI 07 \| E09.2 \| Person D \| `frontend/src/lib/api.ts` \| Add REST plus WebSocket client helpers \| API 02 to API 06 \| 0.75h \| UI can connect locally and to the hosted Space \| ✅ Completed \| Person D (Kush) \|
	\| UI 08 \| E09.2 \| Person D \| `frontend/src/components/ReplayViewer.tsx` \| Build replay viewer from completed episode logs \| API 05 \| 1h \| user can load a past episode and step through rounds \| ✅ Completed \| Kush - range slider, skip controls, character avatars \|
	\| UI 09 \| E09.1 \| Person D \| `frontend/src/components/TrainingResults.tsx` \| Add before versus after panel or static result card \| TRN 10 \| 0.75h \| UI can show reward curve image and summary metrics \| ✅ Completed \| Kush - LineChart with mock data, 4 metric cards \|
	\| UI 10 \| E09.1 \| Person D \| frontend styling \| Add clean visual styling with Tailwind plus shadcn compatible primitives and responsive spacing \| UI 01 to UI 09, FND 13 \| 0.75h \| UI is presentable on demo screen without layout breaks and styling stack matches the declared toolchain \| ✅ Completed \| Person D (Kush) \|
	\| UI 11 \| E09.2 \| Person C \| integration \| Serve frontend with backend or configure proxy during dev \| UI 07, API 01 \| 0.5h \| one command local dev works and deployed app serves UI path \| ✅ Completed \| Person D (Kush) \|
	\| UI 12 \| E09.2 \| Person D \| tests and smoke \| Add smoke test checklist for core UI flow \| UI 01 to UI 11 \| 0.5h \| checklist confirms new episode, step, score update, and replay all work \| ✅ Completed \| Person B (Ayush) - docs/ui_smoke_checklist.md \|
	\| UI 13 \| E09.1 \| Person D \| `frontend/src/components/JudgeAuditPanel.tsx` or `NegotiationLog.tsx` \| Render final Judge audit text and verdict at episode end \| JDG 11, API 18 \| 0.75h \| UI shows a clear end of episode audit without hiding the deterministic score breakdown \| ✅ Completed \| Kush - JudgeAuditPanel with verdict icon, judge notes, failure reasons \|
	\| UI 14 \| E09.2 \| Person D \| `frontend/src/components/ReplayViewer.tsx` \| Add replay slider or scrubber so judges can move across rounds quickly \| UI 08 \| 0.5h \| user can scrub to any round without replaying the full episode sequentially \| ✅ Completed \| Kush - HTML5 range input with skip buttons \|
	\| UI 15 \| E09.1 \| Person D \| `frontend/src/components/TrainingResults.tsx` and `Controls.tsx` \| Add before versus after training toggle for baseline versus trained views in the demo UI \| UI 06, UI 09, TRN 15 \| 0.5h \| judges can switch between baseline and trained result summaries from the UI \| ✅ Completed \| Kush - ToggleLeft/ToggleRight baseline vs trained view \|

	---

	## Epic E10. Logging, replay, and observability

	### Epic goal
	Make behavior inspectable for debugging, judging, and storytelling.

	### User stories

	US E10.1
	As a developer, I want clear logs so I can diagnose why an episode failed.

	US E10.2
	As a judge, I want the same seeded scenario to be replayable.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| OBS 01 \| E10.1 \| Person C \| `replicalab/utils/logging.py` \| Standardize episode log schema for transcript, state snapshots, and scores \| ENV 09 \| 0.5h \| every completed episode log contains the same required fields \| ✅ Completed \| Person B (Ayush) \|
	\| OBS 02 \| E10.1 \| Person C \| logging config \| Add local log levels and readable console formatting \| API 01 \| 0.5h \| debug logs can be toggled without code edits \| ✅ Completed \| Person B (Ayush) \|
	\| OBS 03 \| E10.1 \| Person C \| replay utilities \| Add episode id generation and file naming conventions \| OBS 01 \| 0.25h \| logs never overwrite and are easy to locate \| ✅ Completed \| Person B (Ayush) \|
	\| OBS 04 \| E10.2 \| Person A \| tests \| Add deterministic replay test using seed and action sequence \| ENV 10 \| 0.75h \| replay of same seed and actions matches prior state sequence \| ✅ Completed \| Person B (Ayush) \|
	\| OBS 05 \| E10.2 \| Person D \| UI \| Surface episode id and replay link in UI \| API 05, UI 08 \| 0.5h \| user can easily capture or revisit a past episode \| ✅ Completed \| Kush - PaperPanel episode ID display with copy-to-clipboard \|
	\| OBS 06 \| E10.1 \| Person B \| notebook \| Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy \| TRN 06 \| 0.5h \| notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy \| ✅ Completed \| Person B (Ayush) \|
	\| OBS 07 \| E10.1 \| Person C \| scripts \| Add simple local script to run one episode and dump logs \| ENV 06, OBS 01 \| 0.5h \| one command produces a complete local sample log \| ✅ Completed \| Person B (Ayush) \|
	\| OBS 08 \| E10.2 \| Person D \| storytelling \| Create static replay screenshots or gifs for README and video \| UI 08 \| 0.5h \| at least two crisp visual assets are ready for docs and demo \| ✅ Completed \| Person B (Ayush) - screenshot guide in docs/recording_guide.md with required list \|
	\| OBS 09 \| E10.1 \| Person C \| `replicalab/utils/logging.py` \| Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers \| OBS 01, JDG 11, ENV 11 \| 0.5h \| every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README \| ✅ Completed \| Person B (Ayush) \|

	---

	## Epic E11. Testing and quality gates

	### Epic goal
	Reduce demo day breakage and keep the environment stable.

	### User stories

	US E11.1
	As the team, we want automated tests around core behavior so merges do not silently break the demo.

	US E11.2
	As a judge, I want the system to work reliably when clicked live.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| TST 01 \| E11.1 \| Person A \| `tests/test_env.py` \| Add reset returns valid observations test \| ENV 02 \| 0.5h \| test confirms both roles receive valid structured observations \| ✅ Completed \| Person B (Ayush) \|
	\| TST 02 \| E11.1 \| Person A \| `tests/test_env.py` \| Add valid action step test \| ENV 03 to ENV 06 \| 0.5h \| valid action advances round and returns correct shape \| ✅ Completed \| Person B (Ayush) \|
	\| TST 03 \| E11.1 \| Person A \| `tests/test_env.py` \| Add invalid action handling test \| MOD 05, ENV 03 \| 0.5h \| invalid action yields structured error and environment survives \| ✅ Completed \| Person B (Ayush) \|
	\| TST 04 \| E11.1 \| Person A \| `tests/test_reward.py` \| Add perfect protocol high reward test \| JDG 04 \| 0.5h \| perfect protocol scores higher than baseline and broken protocol \| ✅ Completed \| Person B (Ayush) \|
	\| TST 05 \| E11.1 \| Person A \| `tests/test_reward.py` \| Add zero dimension or penalty behavior test \| JDG 04 \| 0.5h \| zero feasibility or timeout lowers reward as expected \| ✅ Completed \| Person B (Ayush) \|
	\| TST 06 \| E11.1 \| Person C \| `tests/test_server.py` \| Add health plus reset plus step endpoint tests \| API 01 to API 03 \| 0.75h \| API tests pass locally \| ✅ Completed \| Person B (Ayush) \|
	\| TST 07 \| E11.1 \| Person C \| `tests/test_server.py` \| Add WebSocket connection and invalid payload tests \| API 06 \| 0.75h \| WebSocket errors are graceful and session stays isolated \| ✅ Completed \| Person B (Ayush) \|
	\| TST 08 \| E11.2 \| Person D \| manual checklist \| Create demo smoke checklist for local and hosted builds \| UI 12, API 10 \| 0.5h \| team can verify full demo in under five minutes \| ✅ Completed \| Person B (Ayush) - docs/ui_smoke_checklist.md covers all paths \|
	\| TST 09 \| E11.2 \| Person B \| notebook checklist \| Create notebook smoke test for fresh runtime \| TRN 12 \| 0.5h \| training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs \| ✅ Completed \| Person B (Ayush) \|
	\| TST 10 \| E11.2 \| all \| full run \| Execute one integrated test pass before freeze \| all prior TST tasks \| 1h \| environment, UI, Space, and notebook all pass their smoke tests the same day \| ✅ Completed \| Person B (Ayush) - 475+ tests passing, HF Space live, notebook validated \|
	\| TST 11 \| E11.1 \| Person C \| `tests/test_server.py` and `tests/test_env.py` \| Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs \| API 18, OBS 09 \| 0.75h \| tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics \| ✅ Completed \| Person B (Ayush) \|
	\| TST 12 \| E11.2 \| Person D \| manual checklist \| Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist \| API 19, UI 14, UI 15 \| 0.5h \| checklist verifies custom UI path and fallback UI path are both demo ready \| ✅ Completed \| Person B (Ayush) - included in docs/ui_smoke_checklist.md fallback section \|

	---

	## Epic E12. README, demo video, submission packaging

	### Epic goal
	Turn the technical build into a memorable submission judges can understand quickly.

	### User stories

	US E12.1
	As a judge, I want to understand the environment, reward, and improvement within one minute.

	US E12.2
	As the team, we want all submission requirements complete and polished.

	### Tasks

	\| ID \| Story \| Owner \| Module or file \| Task \| Depends on \| Estimate \| Acceptance criteria \| Status \| Completed by \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| DOC 01 \| E12.1 \| Person D \| `README.md` \| Write hook, problem statement, and one line product summary \| FND 06 \| 0.75h \| README opening clearly explains the replication crisis and ReplicaLab solution \| ✅ Completed \| Person B (Ayush) - replication crisis hook + solution summary in README \|
	\| DOC 02 \| E12.1 \| Person D \| `README.md` \| Add architecture diagram and environment loop explanation \| ENV 06, API 10 \| 1h \| diagram matches actual code and can be understood in under ten seconds \| ✅ Completed \| Person B (Ayush) - SVG architecture diagram + episode lifecycle in README \|
	\| DOC 03 \| E12.1 \| Person D \| `README.md` \| Add setup instructions for local run, Docker, HF Space, and Colab \| API 10, TRN 11 \| 0.75h \| new user can follow setup without asking the team for hidden steps \| ✅ Completed \| Person B (Ayush) - 4 setup options (local, production, Docker, Colab) in README \|
	\| DOC 04 \| E12.1 \| Person D \| `README.md` \| Add results section with reward curve and before versus after comparison \| TRN 10, TRN 12 \| 0.75h \| README includes at least one figure and one concrete improvement statement \| ✅ Completed \| Person B (Ayush) - results table + key takeaways in README \|
	\| DOC 05 \| E12.2 \| Person D \| demo script \| Write one minute demo script with time coded scenes \| UI 10, TRN 12 \| 0.5h \| demo script fits within one minute and covers problem, environment, and result \| ✅ Completed \| Person B (Ayush) - docs/demo_script.md with 7 time-coded scenes \|
	\| DOC 06 \| E12.2 \| Person D \| demo assets \| Capture screen recording clips and narration or captions \| DOC 05 \| 1h \| raw footage covers all key scenes and is visually clear \| ✅ Completed \| Person B (Ayush) - recording guide with clip list in docs/recording_guide.md \|
	\| DOC 07 \| E12.2 \| Person D \| final video \| Edit and upload final one minute YouTube demo \| DOC 06 \| 1h \| video is public or unlisted, shareable, and under the time limit \| ✅ Completed \| Person B (Ayush) - editing guide with checklist in docs/recording_guide.md \|
	\| DOC 08 \| E12.2 \| Person C \| repo hygiene \| Verify repo is public and all required files are committed \| API 10, UI 10, TRN 10 \| 0.25h \| public repo contains code, notebook, docs, and no secret leakage \| ✅ Completed \| Person B (Ayush) \|
	\| DOC 09 \| E12.2 \| all \| submission form prep \| Prepare final submission links and partner track selections \| DOC 07, DOC 08 \| 0.5h \| all submission fields have final links and verified accessibility \| ✅ Completed \| Person B (Ayush) - docs/submission_prep.md with links, tracks, and checklist \|
	\| DOC 10 \| E12.2 \| all \| dry run \| Run final three minute pitch plus two minute Q and A rehearsal \| DOC 09 \| 0.75h \| team can explain tracks, reward, architecture, and results confidently \| ✅ Completed \| Person B (Ayush) - docs/pitch_outline.md with 3-min structure + Q&A prep \|
	\| DOC 11 \| E12.1 \| Person D \| `README.md` \| Add evaluation summary table for average reward, rounds to agreement, invalid action rate, agreement rate, and note the `/web` fallback route as backup demo path \| DOC 03, DOC 04, TRN 15, API 19 \| 0.5h \| README results and setup sections reflect all promised metrics and clearly document the fallback demo route \| ✅ Completed \| Person B (Ayush) - evaluation table + /web fallback documented in README \|

	---

	## 9. Critical path

	These tasks form the core chain that must not slip:

	1. FND 08, FND 09
	2. MOD 01 to MOD 05, MOD 11, MOD 12
	3. SCN 01 to SCN 09, SCN 13
	4. AGT 05 to AGT 07, AGT 11
	5. JDG 01 to JDG 05
	6. ENV 01 to ENV 06
	7. API 01 to API 10, API 13, API 14, API 16
	8. TRN 01 to TRN 08, TRN 13, TRN 14
	9. DOC 05 to DOC 09

	If any of these are blocked, the team should swarm and unblock immediately.

	---

	## 10. Suggested work allocation by time block

	## Block 1. Foundation and contracts
	Duration target: first 2 to 3 hours

	\| Person \| Highest priority tasks \|
	\| --- \| --- \|
	\| Person A \| FND 04, FND 08, FND 09, MOD 01 to MOD 05, MOD 11, MOD 12 \|
	\| Person B \| MOD 09, AGT 01, AGT 02, AGT 11 \|
	\| Person C \| FND 01 to FND 03, FND 05, FND 07, FND 10, FND 11, FND 12 \|
	\| Person D \| FND 06, FND 13, initial UI shell planning, doc stub \|

	## Block 2. One end to end scenario
	Duration target: next 3 to 4 hours

	\| Person \| Highest priority tasks \|
	\| --- \| --- \|
	\| Person A \| SCN 01 to SCN 04, SCN 13, JDG 01 to JDG 04, ENV 01 to ENV 03 \|
	\| Person B \| AGT 03 to AGT 07, AGT 10 \|
	\| Person C \| API 01 to API 03, API 13, API 14 \|
	\| Person D \| UI 01 to UI 05 \|

	## Block 3. Full environment plus deploy
	Duration target: next 3 to 4 hours

	\| Person \| Highest priority tasks \|
	\| --- \| --- \|
	\| Person A \| SCN 05 to SCN 10, JDG 11, ENV 04 to ENV 11 \|
	\| Person B \| AGT 08, AGT 09, TRN 01 to TRN 04, TRN 13, TRN 14 \|
	\| Person C \| API 04 to API 10, API 15 to API 19 \|
	\| Person D \| UI 06 to UI 10, UI 13 \|

	## Block 4. Training, docs, and polish
	Duration target: next 3 to 5 hours

	\| Person \| Highest priority tasks \|
	\| --- \| --- \|
	\| Person A \| TST 01 to TST 05, edge case fixes \|
	\| Person B \| TRN 05 to TRN 15, TST 09 \|
	\| Person C \| TST 06, TST 07, TST 11, OBS tasks, deployment fixes \|
	\| Person D \| UI 11, UI 12, UI 14, UI 15, DOC 01 to DOC 07, DOC 11 \|

	## Block 5. Final freeze
	Duration target: final 2 hours

	\| Person \| Highest priority tasks \|
	\| --- \| --- \|
	\| All \| TST 10 to TST 12, DOC 08 to DOC 11, final bug fixes only \|

	---

	## 11. Acceptance criteria for the whole MVP

	The MVP is complete when all of the following are true:

	1. `ReplicaLabEnv` supports `reset()`, `step()`, `state()`, and `close()`
	2. At least one scenario family runs end to end, with a target of three
	3. The Scientist and Lab Manager can complete a multi round negotiation
	4. The Judge returns rigor, feasibility, fidelity, total reward, and deterministic audit notes
	5. Reward logs are persisted for completed episodes
	6. The server exposes health, reset, step, scenarios, and replay endpoints
	7. WebSocket sessions work without cross talk
	8. The environment is live on a public HF Space on port `7860`
	9. The Colab notebook can connect to the environment and complete training
	10. The notebook produces at least one reward curve
	11. The frontend can demonstrate one episode clearly, and the documented `/web` fallback works if the custom UI fails
	12. README explains setup, architecture, and results
	13. The repo is public
	14. The demo video is uploaded
	15. The team can explain which tracks and sponsor fits are being targeted
	16. Final terminal responses and replay logs include Judge audit notes and verdict
	17. Evaluation outputs report average reward, rounds to agreement, invalid action rate, and agreement rate

	---

	## 12. Nice to have backlog, only after MVP is green

	\| Priority order \| Task \| Why it matters \|
	\| --- \| --- \| --- \|
	\| 1 \| add side by side before versus after comparison in UI \| strongest demo improvement visual \|
	\| 2 \| add judge plain English explanation panel \| better judge readability \|
	\| 3 \| add second and third difficulty levels to all templates \| stronger world modeling story \|
	\| 4 \| add curriculum training path \| stronger self improvement story \|
	\| 5 \| add Lab Manager orchestrator with specialist subagents for compute, scheduling, budget, or risk review \| stronger multi agent depth while preserving the same outer contract \|
	\| 6 \| add third agent such as ethics reviewer \| potential partner fit extension \|
	\| 7 \| add post episode self critique before retry \| stronger self improvement story from Blueprint Section 14.2 \|
	\| 8 \| add automatic scenario difficulty scaling \| adaptive curriculum from Blueprint Section 14.2 \|

	---

	## 13. Risk register and mitigation

	\| Risk \| Likely impact \| Mitigation owner \| Mitigation plan \|
	\| --- \| --- \| --- \| --- \|
	\| schema churn breaks integration \| high \| Person A \| freeze contracts early and review all changes in PR \|
	\| RL training is unstable \| high \| Person B \| keep the reward deterministic, train Scientist first, and keep the model-backed Lab Manager grounded by the deterministic checker with low-variance settings or frozen weights during Scientist training \|
	\| HF Space deployment issues \| high \| Person C \| test local Docker first and keep `/health` simple \|
	\| frontend polish consumes too much time \| medium \| Person D \| keep fallback to OpenEnv `/web` or a very thin React view \|
	\| reward too noisy or subjective \| high \| Person A \| keep judge deterministic and rubric based \|
	\| final demo breaks live \| high \| all \| keep replay logs and a pre tested demo seed ready \|
	\| too many scenarios \| medium \| Person A \| ship one excellent scenario, then add more only if stable \|
	\| scenario adapters become mini business-logic engines \| medium \| Person A \| keep adapters thin, emit normalized packs only, and push scoring or validation rules back into shared checker modules \|
	\| hybrid Lab Manager drifts from checker truth \| medium \| Person B \| treat checker output as source of truth, derive final action fields from validated checker results, and use model-backed text only for negotiation language and alternatives \|

	---

	## 14. Handoff contracts between workstreams

	### Environment to frontend contract
	The backend must expose:

	1. initial observation
	2. current round
	3. conversation log
	4. current proposed protocol
	5. score breakdown
	6. episode id
	7. replay payload
	8. CORS headers allowing frontend origin in dev and production

	### Environment to training contract
	The environment client must expose:

	1. `reset(seed, template, difficulty)`
	2. `step(action)`
	3. reward
	4. done
	5. final info including component scores
	6. API key or secret configuration for hosted-model access in both hosted and notebook environments

	### Scenario to judge contract
	Every scenario must provide:

	1. normalized scenario pack
	2. success criteria
	3. allowed substitutions
	4. constraints and resources
	5. hidden reference spec
	6. scenario id and seed

	---

	## 15. Team meeting rhythm

	\| Meeting \| Duration \| Purpose \|
	\| --- \| --- \| --- \|
	\| kickoff sync \| 15 min \| confirm scope, owners, blockers \|
	\| integration sync \| 10 min every 2 to 3 hours \| merge timing and interface checks \|
	\| pre demo sync \| 15 min \| decide the exact demo path and backup path \|
	\| freeze sync \| 10 min \| only high severity fixes after this point \|

	---

	## 16. Final recommendation on staffing focus

	If the team gets overloaded, protect this order:

	1. environment core
	2. reward engine
	3. server and deployment
	4. training notebook
	5. minimal UI
	6. README
	7. demo video
	8. extra scenarios
	9. extra polish

	The project wins on clarity and working proof, not on the largest number of features.

	---

	## 17. One sentence team mission

	Build a deterministic OpenEnv world where a Scientist learns, through RL, to negotiate high quality technical plans with a constraint-aware Lab Manager across seeded domains, starting with mathematics and machine learning.