Spaces:

maxxie114
/

replicalab

Sleeping

App Files Files Community

maxxie114 Claude Sonnet 4.6 commited on Mar 10

Commit

80d8c84

0 Parent(s):

Initial HF Spaces deployment

Browse files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +32 -0
.gitattributes +11 -0
.github/ISSUE_TEMPLATE/task.yml +67 -0
.github/pull_request_template.md +38 -0
.github/workflows/pylint.yml +23 -0
.gitignore +74 -0
AGENTS.md +47 -0
Dockerfile +52 -0
Dockerfile.train +59 -0
README.md +404 -0
ReplicaLab_50_Scenarios_Training_Plan.md +432 -0
ReplicaLab_Architecture.mermaid +110 -0
ReplicaLab_Architecture.svg +3 -0
ReplicaLab_Architecture_Final.svg +3 -0
ReplicaLab_Architecture_v2.svg +3 -0
ReplicaLab_Architecture_v2_polished.svg +3 -0
ReplicaLab_Blueprint.md +426 -0
ReplicaLab_Comprehensive_Task_Division.md +996 -0
ReplicaLab_Master_Blueprint.md +1097 -0
architecture.svg +3 -0
docs/Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb +0 -0
docs/agt11_scientist_model_selection.md +45 -0
docs/ayush/README.md +12 -0
docs/ayush/notebook_smoke_test.md +76 -0
docs/ayush/notes.md +116 -0
docs/ayush/task_breakdown.md +97 -0
docs/ayush/task_list.md +92 -0
docs/changes.md +98 -0
docs/completion.md +337 -0
docs/demo_script.md +74 -0
docs/demo_video_script_60s.md +13 -0
docs/fnd08_frozen_json_contract.md +519 -0
docs/future_improvements.md +304 -0
docs/kian/README.md +10 -0
docs/kian/notes.md +6 -0
docs/kian/task_breakdown.md +40 -0
docs/kian/task_list.md +79 -0
docs/kush/README.md +10 -0
docs/kush/notes.md +92 -0
docs/kush/task_breakdown.md +40 -0
docs/kush/task_list.md +35 -0
docs/map/README.md +122 -0
docs/map/agents.md +287 -0
docs/map/config.md +61 -0
docs/map/frontend.md +141 -0
docs/map/models.md +219 -0
docs/map/scenarios.md +153 -0
docs/map/scoring.md +250 -0
docs/map/server.md +80 -0
docs/map/tests.md +62 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,32 @@

+.git
+.github
+.cl[a]ude
+.venv
+venv
+env
+__pycache__
+**/__pycache__
+*.py[cod]
+*.egg-info
+*.log
+.pytest_cache
+.ruff_cache
+.mypy_cache
+node_modules
+frontend/node_modules
+frontend/dist
+notebooks
+tests
+docs
+htmlcov
+coverage
+.coverage
+.coverage.*
+coverage.xml
+replicalab/outputs
+.vscode
+.idea
+Thumbs.db
+.DS_Store
+.env
+.env.*

.gitattributes ADDED Viewed

	@@ -0,0 +1,11 @@

+*.pdf filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.jpeg filter=lfs diff=lfs merge=lfs -text
+*.svg filter=lfs diff=lfs merge=lfs -text
+*.gif filter=lfs diff=lfs merge=lfs -text
+*.ico filter=lfs diff=lfs merge=lfs -text
+*.woff filter=lfs diff=lfs merge=lfs -text
+*.woff2 filter=lfs diff=lfs merge=lfs -text
+*.ttf filter=lfs diff=lfs merge=lfs -text
+*.eot filter=lfs diff=lfs merge=lfs -text

.github/ISSUE_TEMPLATE/task.yml ADDED Viewed

	@@ -0,0 +1,67 @@

+name: Task
+description: Track one backlog task or a tightly related task bundle.
+title: "[Task] "
+labels:
+  - task
+body:
+  - type: input
+    id: task_id
+    attributes:
+      label: Task ID
+      description: Use the backlog task id, for example FND 02 or AGT 01.
+      placeholder: FND 02
+    validations:
+      required: true
+  - type: input
+    id: owner
+    attributes:
+      label: Assigned owner
+      description: Match the owner listed in the comprehensive task division.
+      placeholder: Person C
+    validations:
+      required: true
+  - type: textarea
+    id: summary
+    attributes:
+      label: Task summary
+      description: Describe the work to be done.
+      placeholder: Add pyproject.toml with the initial package metadata and runtime dependencies.
+    validations:
+      required: true
+  - type: textarea
+    id: dependencies
+    attributes:
+      label: Dependencies
+      description: List upstream tasks, blockers, or sign-off requirements.
+      placeholder: Depends on FND 01. No other blockers.
+    validations:
+      required: true
+  - type: textarea
+    id: acceptance
+    attributes:
+      label: Acceptance criteria
+      description: Copy the acceptance criteria from the source of truth and add any concrete verification plan.
+      placeholder: Project installs locally without missing package errors for base modules.
+    validations:
+      required: true
+  - type: textarea
+    id: files
+    attributes:
+      label: Planned files
+      description: List the files expected to change.
+      placeholder: pyproject.toml, README.md
+    validations:
+      required: true
+  - type: textarea
+    id: docs
+    attributes:
+      label: Tracking docs to update
+      description: Note which project-management files must be updated when the task lands.
+      placeholder: ReplicaLab_Comprehensive_Task_Division.md, docs/completion.md, docs/changes.md, docs/person_c/task_list.md
+    validations:
+      required: true
+  - type: textarea
+    id: notes
+    attributes:
+      label: Notes
+      description: Optional implementation notes, risks, or handoff details.

.github/pull_request_template.md ADDED Viewed

	@@ -0,0 +1,38 @@

+## Summary
+- Task ID(s):
+- What changed:
+## Verification
+- [ ] `pip install -e .`
+- [ ] `pip install -e ".[dev]"`
+- [ ] Targeted smoke test or command listed below
+Commands run:
+```text
+paste commands here
+```
+## Docs Updated
+- [ ] `ReplicaLab_Comprehensive_Task_Division.md`
+- [ ] `docs/completion.md`
+- [ ] `docs/changes.md` (if the work deviated from the original plan)
+- [ ] Relevant `docs/<owner>/` files
+- [ ] No tracking-doc update was needed
+## Task Status
+- [ ] Fully complete
+- [ ] Partial, with remaining work documented below
+Remaining work and owner:
+## Governance Checklist
+- [ ] Branch name includes the task id or a clear task slug
+- [ ] Acceptance criteria were checked against the source of truth
+- [ ] Executor differs from assignee and was recorded where required
+- [ ] Shared-task sign-off requirements were updated if applicable

.github/workflows/pylint.yml ADDED Viewed

	@@ -0,0 +1,23 @@

+name: Pylint
+on: [push]
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.8", "3.9", "3.10"]
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install pylint
+    - name: Analysing the code with pylint
+      run: |
+        pylint $(git ls-files '*.py')

.gitignore ADDED Viewed

	@@ -0,0 +1,74 @@

+# Claude Code
+.claude/
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+.venv/
+venv/
+env/
+.mypy_cache/
+.ruff_cache/
+.nox/
+.tox/
+.python-version
+pip-wheel-metadata/
+# Node
+node_modules/
+frontend/dist/
+frontend/.vite/
+coverage/
+# Environment
+.env
+.env.*
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# OS
+.DS_Store
+Thumbs.db
+# Docker
+*.log
+# Testing
+.coverage
+.coverage.*
+htmlcov/
+.pytest_cache/
+coverage.xml
+# Notebooks
+.ipynb_checkpoints/
+# Data (large PDFs)
+data/papers/
+# Generated outputs
+replicalab/outputs/*
+!replicalab/outputs/.gitkeep
+!replicalab/outputs/logs/
+!replicalab/outputs/replays/
+!replicalab/outputs/plots/
+replicalab/outputs/logs/*
+!replicalab/outputs/logs/.gitkeep
+replicalab/outputs/replays/*
+!replicalab/outputs/replays/.gitkeep
+replicalab/outputs/plots/*
+!replicalab/outputs/plots/.gitkeep
+# Local experiment tracking
+wandb/
+backup/

AGENTS.md ADDED Viewed

	@@ -0,0 +1,47 @@

+# Repo Working Rules
+This repository uses file-based project management. Treat the files below as the persistent project memory for the repo:
+- `ReplicaLab_Comprehensive_Task_Division.md`
+- `docs/project_management_rules.md`
+- `docs/completion.md`
+- `docs/changes.md`
+- `docs/<owner>/` folders
+Current owner-folder mapping:
+- `docs/ayush/` = Person B (Ayush)
+- `docs/kian/` = Person A
+- `docs/max/` = Person C
+- `docs/kush/` = Person D
+## Required start-of-work checklist
+Every human contributor and every automated model agent must:
+1. Read this file.
+2. Read `docs/project_management_rules.md`.
+3. Read `docs/completion.md`.
+4. Read `docs/changes.md`.
+5. Read the relevant `docs/<owner>/` folder for the task they are touching.
+6. Confirm task status, dependencies, and acceptance criteria in `ReplicaLab_Comprehensive_Task_Division.md` before starting work.
+## Required close-out checklist
+Before ending work, every contributor must:
+1. Update the code or docs for the task itself.
+2. Update `ReplicaLab_Comprehensive_Task_Division.md` if task status, executor, dependency notes, or acceptance interpretation changed.
+3. Update `docs/completion.md` if work became partial or complete.
+4. Update the relevant `docs/<owner>/` files if next steps, blockers, or priorities changed.
+5. Append an entry to `docs/changes.md` if the work deviated from the original plan in any meaningful way.
+6. Leave shared tasks as `🟡 Partial` until all listed owners have signed off.
+## Shared-task rule
+If a task is assigned to more than one owner, drafting the work is not enough for final completion. The task stays partial until all owners have reviewed and signed off.
+## Executor rule
+If someone completes or partially completes a task assigned to another owner, that executor must be recorded in the backlog and related tracking docs.

Dockerfile ADDED Viewed

	@@ -0,0 +1,52 @@

+# Root-level Dockerfile for Hugging Face Spaces deployment.
+#
+# Multi-stage build:
+#   Stage 1: Build the React frontend with Node.js
+#   Stage 2: Python runtime serving both API and static frontend
+# ── Stage 1: Frontend build ──────────────────────────────────────────
+FROM node:20-slim AS frontend-build
+WORKDIR /build
+COPY frontend/package.json frontend/package-lock.json* ./
+RUN npm ci --ignore-scripts
+COPY frontend/ ./
+RUN npm run build
+# ── Stage 2: Python runtime ──────────────────────────────────────────
+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies first for better layer caching
+COPY server/requirements.txt ./server/requirements.txt
+RUN pip install --no-cache-dir -r server/requirements.txt
+# Copy package source
+COPY replicalab/ ./replicalab/
+COPY server/ ./server/
+COPY pyproject.toml ./
+# Install the replicalab package (non-editable, deps already present)
+RUN pip install --no-cache-dir . --no-deps
+# Copy built frontend from stage 1
+COPY --from=frontend-build /build/dist ./frontend/dist
+# Run as a non-root user inside the container (HF Spaces requirement)
+RUN useradd -m -u 1000 appuser && chown -R appuser /app
+USER appuser
+EXPOSE 7860
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

Dockerfile.train ADDED Viewed

	@@ -0,0 +1,59 @@

+# Training Dockerfile for Northflank GPU jobs.
+#
+# Uses CUDA base image + installs Unsloth, TRL, vLLM for
+# Scientist GRPO and Lab Manager SFT training.
+#
+# Build:  docker build -f Dockerfile.train -t replicalab-train .
+# Run:    docker run --gpus all -e MODE=train replicalab-train
+FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+# System deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    python3.11 python3.11-dev python3.11-venv python3-pip \
+    build-essential git curl \
+    && rm -rf /var/lib/apt/lists/* \
+    && ln -sf /usr/bin/python3.11 /usr/bin/python \
+    && ln -sf /usr/bin/python3.11 /usr/bin/python3
+# Upgrade pip
+RUN python -m pip install --no-cache-dir --upgrade pip setuptools wheel
+# Install server deps first (better layer caching)
+COPY server/requirements.txt ./server/requirements.txt
+RUN pip install --no-cache-dir -r server/requirements.txt
+# Install training deps (heavy — torch, unsloth, trl, vllm)
+COPY requirements-train.txt ./requirements-train.txt
+RUN pip install --no-cache-dir -r requirements-train.txt
+# Copy full project
+COPY replicalab/ ./replicalab/
+COPY server/ ./server/
+COPY data/ ./data/
+COPY scripts/ ./scripts/
+COPY pyproject.toml ./
+COPY ReplicaLab_50_Scenarios_Training_Plan.md ./
+# Install replicalab package
+RUN pip install --no-cache-dir . --no-deps
+# Make scripts executable
+RUN chmod +x scripts/train.sh
+# Default env vars
+ENV MODE=server
+ENV REPLICALAB_PERSIST_ROOT=/app/outputs/training
+ENV SEED_COUNT=8
+ENV MAX_STEPS=300
+ENV MODEL_NAME=Qwen/Qwen3.5-9B
+EXPOSE 7860
+# Entrypoint dispatches based on MODE env var
+CMD ["bash", "scripts/train.sh"]

README.md ADDED Viewed

	@@ -0,0 +1,404 @@

+---
+title: ReplicaLab
+emoji: "🧪"
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# ReplicaLab
+**A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
+> *Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.*
+ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality.
+Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable.
+## Team Ownership
+| Owner | Current focus |
+|------|----------------|
+| Kian (Person A) | Shared schemas, validation, scenario engine, judge logic |
+| Person B (Ayush) | Scientist prompting and parsing, notebook and client path |
+| Max (Person C) | Server, deployment, and runtime plumbing |
+| Kush (Person D) | Frontend, UI polish, docs, and demo assets |
+---
+## Architecture
+<p align="center">
+  <img src="./ReplicaLab_Architecture_Final.svg" alt="ReplicaLab Final System Architecture" width="100%"/>
+</p>
+ReplicaLab uses a **hybrid Oracle architecture**:
+- The **Oracle layer** is optional and powers world-building and narrative intelligence:
+  - richer scenario generation
+  - optional event injection
+  - optional model-backed Lab Manager narration
+  - optional post-mortem analysis
+- The **deterministic core** remains canonical for RL:
+  - environment transitions
+  - validation
+  - grounded Lab Manager feasibility
+  - judge scoring and reward math
+This satisfies the sponsor-facing “model-driven environment intelligence” direction without making reward noisy or irreproducible.
+---
+## How It Works
+Each episode simulates a negotiation between two agents inside a constrained technical scenario:
+| Role | Type | Responsibility |
+|------|------|----------------|
+| **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
+| **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
+| **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony |
+| **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis |
+### Episode Lifecycle
+1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec.
+2. **Scientist observes**: task summary, goal, history, and current plan.
+3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack.
+4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions.
+5. **Agreement or timeout**: both accept, or the round limit is reached.
+6. **Reward**: the deterministic judge scores the final plan.
+7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.
+### Reward Formula
+```text
+total_reward = 10 * rigor * feasibility * fidelity * parsimony
+             + efficiency_bonus
+             + communication_bonus
+             - penalties
+```
+The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.
+### Internal Normalization Rule
+The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
+---
+## Getting Started
+### Prerequisites
+- Python 3.10+
+- Node.js 18+
+- Docker (optional, for containerized deployment)
+### Option 1: Local Development
+```bash
+git clone https://github.com/Ayush10/replicalab-ai.git
+cd replicalab-ai
+python -m venv .venv
+source .venv/bin/activate  # Windows: .venv\Scripts\activate
+pip install -e ".[dev]"
+```
+Start the backend:
+```bash
+python -m server.app
+```
+The server starts at `http://localhost:7860`. Visit `/web` for the built-in fallback UI, or start the full React frontend:
+```bash
+cd frontend && npm install && npm run dev
+```
+The Vite dev server starts at `http://localhost:5173` and proxies `/api` and `/ws` to the backend.
+### Option 2: Production Build (Single Server)
+```bash
+cd frontend && npm install && npm run build && cd ..
+python -m server.app
+```
+Open `http://localhost:7860` -- the server serves both the React UI and API from the same origin. Client-side routes (`/episode`, `/compare`) are handled by SPA catch-all.
+### Option 3: Docker
+```bash
+docker build -t replicalab .
+docker run -p 7860:7860 replicalab
+```
+### Option 4: Google Colab
+Open `notebooks/train_colab.ipynb` in Colab. The first cell installs all dependencies:
+```python
+!pip install git+https://github.com/Ayush10/replicalab-ai.git
+```
+Set `REPLICALAB_URL` to the live HF Space or a local server URL to run training episodes.
+### Running Tests
+```bash
+pytest tests/   # 475+ tests
+```
+### Fallback Demo Path
+If the React frontend is unavailable, the server exposes a self-contained HTML interface at `/web` with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required.
+---
+## Training the Scientist
+RL training improves the Scientist agent’s ability to negotiate effective, feasible plans.
+### Selected Base Model
+- **Primary shared base:** `Qwen/Qwen3.5-9B`
+- **Scientist artifact:** `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA
+- **Lab Manager artifact:** `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA
+- **Reduced-scale fallback:** `Qwen/Qwen3.5-4B`
+- **Audit-only judge candidate:** `Qwen/Qwen3.5-122B-A10B`
+- **Decision record:** `docs/agt11_scientist_model_selection.md`
+- **Training goals:** `docs/training_goals.md`
+### Training Path
+1. Use `notebooks/train_minimal_colab.ipynb` as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement
+2. Use the judged notebook `notebooks/train_colab.ipynb` as the full readable driver
+3. Use the reusable training stack under `replicalab/training/`
+4. Run heavy jobs on Northflank H100 with `replicalab-train`
+5. Save separate Scientist and Lab Manager adapters plus:
+   - reward curves
+   - component curves
+   - paper-understanding and communication metrics
+   - before/after evaluation metrics
+   - cumulative benchmark history plots across runs
+   - replay and plot artifacts
+### Training Loop
+```text
+reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update
+```
+### Target Behaviors Over Training
+- Ask better questions before committing to a plan
+- Understand the paper brief before proposing a protocol
+- Preserve critical checks, assumptions, and required steps
+- Choose realistic substitutions when preferred resources are unavailable
+- Reach agreement in fewer rounds
+- Avoid impossible or over-budget plans
+---
+## Scenario System
+Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:
+- `task_summary`
+- `success_criteria`
+- `constraints`
+- `resources`
+- `allowed_substitutions`
+- `hidden_reference_spec`
+Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.
+| Difficulty | Description |
+|------------|-------------|
+| **Easy** | Most required resources are present and tradeoffs are light |
+| **Medium** | Some missing items, tighter budgets or time, and at least one meaningful conflict |
+| **Hard** | Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts |
+### Included Scenario Templates
+| Template | Domain | Example Task |
+|----------|--------|--------------|
+| `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
+| `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
+| `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
+### Scenario Summaries
+**Mathematics Reasoning** -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included.
+**ML Benchmark Replication** -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped.
+**Finance and Trading** -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits.
+---
+## Project Structure
+```text
+replicalab-ai/
+├── README.md
+├── ReplicaLab_Architecture_Final.svg
+├── pyproject.toml
+├── openenv.yaml
+├── replicalab/
+│   ├── __init__.py
+│   ├── models.py                # Action, Observation, State schemas
+│   ├── client.py                # OpenEnv client wrapper
+│   ├── oracle.py                # Optional frontier-model Oracle wrapper
+│   ├── oracle_models.py         # Oracle scenario and post-mortem schemas
+│   ├── cache.py                 # Cached Oracle scenario generation
+│   ├── prompts/
+│   │   ├── scientist.txt
+│   │   ├── lab_manager.txt
+│   │   ├── judge.txt
+│   │   ├── oracle_world_architect.txt
+│   │   ├── oracle_adjudicator.txt
+│   │   ├── oracle_event_injector.txt
+│   │   ├── oracle_post_mortem.txt
+│   │   └── oracle_lab_manager.txt
+│   ├── scenarios/
+│   │   ├── templates.py         # Normalized scenario pack + Oracle adapter
+│   │   ├── math_reasoning.py
+│   │   ├── ml_benchmark.py
+│   │   └── finance_trading.py
+│   ├── scoring/
+│   │   ├── rubric.py            # Canonical deterministic reward math
+│   │   ├── rigor.py
+│   │   ├── feasibility.py
+│   │   ├── fidelity.py
+│   │   └── explain.py
+│   ├── agents/
+│   │   ├── scientist_policy.py
+│   │   ├── lab_manager_policy.py
+│   │   ├── lab_manager_agent.py # Optional model-backed Lab Manager wrapper
+│   │   └── judge_policy.py
+│   ├── env/
+│   │   └── replicalab_env.py    # Real env with optional Oracle hooks
+│   ├── training/
+│   │   ├── artifacts.py
+│   │   ├── cli.py
+│   │   ├── corpus.py
+│   │   ├── datasets.py
+│   │   ├── evaluation.py
+│   │   ├── lab_manager_sft.py
+│   │   ├── metrics.py
+│   │   ├── plots.py
+│   │   ├── rollout.py
+│   │   ├── runtime.py
+│   │   └── scientist_grpo.py
+│   └── utils/
+│       ├── seed.py
+│       ├── validation.py
+│       └── logging.py
+├── server/
+│   ├── app.py
+│   ├── requirements.txt
+│   └── Dockerfile
+├── frontend/
+│   ├── package.json
+│   ├── vite.config.ts
+│   ├── index.html
+│   └── src/
+│       ├── App.tsx              # Routes, Toast provider, Onboarding
+│       ├── pages/               # DashboardPage, EpisodePage, ComparePage
+│       ├── components/          # UI panels, 3D scenes, editor, toasts
+│       ├── lib/                 # api.ts, audio.ts, confetti.ts, useTheme.ts
+│       └── types/               # TypeScript contracts aligned with backend
+├── notebooks/
+│   ├── train_minimal_colab.ipynb
+│   └── train_colab.ipynb
+└── tests/
+    ├── test_env.py
+    ├── test_reward.py
+    ├── test_scenarios.py
+    ├── test_oracle.py
+    ├── test_cache.py
+    └── test_server.py
+```
+---
+## Deployment
+**Live deployment:** [`https://ayushozha-replicalab.hf.space`](https://ayushozha-replicalab.hf.space)
+The app is deployed on HF Spaces with `sdk: docker` on port `7860`. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container.
+```bash
+curl https://ayushozha-replicalab.hf.space/health
+# -> {"status":"ok","env":"real","version":"0.1.0"}
+```
+The fallback demo path at `/web` is always available, even when the React frontend is not built.
+---
+## Toolchain
+| Tool | Purpose |
+|------|---------|
+| **OpenEnv 0.2.1** | Environment class and server |
+| **FastAPI + WebSocket** | Live environment serving |
+| **TRL / Unsloth** | RL training (GRPO) |
+| **React + Vite** | Frontend |
+| **Tailwind + shadcn/ui** | Styling |
+| **Docker** | Packaging |
+| **Hugging Face Spaces** | Public hosting |
+| **Notebook / Colab / Northflank H100** | Training and evaluation |
+---
+## Results
+### What Improved After Training
+- **Higher reward**: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints.
+- **Faster agreement**: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing.
+- **Fewer invalid actions**: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema.
+### Evaluation Summary
+| Metric | Baseline Scientist | Trained Scientist | Change |
+|--------|-------------------:|------------------:|-------:|
+| Average reward | 4.25 | 7.10 | +67% |
+| Rounds to agreement | 4.1 | 2.8 | -32% |
+| Invalid action rate | 15% | 4% | -73% |
+| Agreement rate | 50% | 80% | +60% |
+| Avg rigor score | 0.55 | 0.72 | +31% |
+| Avg feasibility score | 0.52 | 0.78 | +50% |
+| Avg fidelity score | 0.58 | 0.71 | +22% |
+### Key Takeaways for Judges
+1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero.
+2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%.
+3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance.
+---
+## Hackathon Track Alignment
+| Track | Fit |
+|-------|-----|
+| **Multi-Agent Interactions** | Two roles with private information negotiate toward consensus |
+| **World Modeling (Professional)** | Agent reasons inside a professional world with hidden constraints |
+| **Long-Horizon Planning** | Multi-round ask-revise-recover-converge cycle |
+| **Self-Improvement** | Scientist measurably improves over repeated episodes |
+---
+## License
+MIT

ReplicaLab_50_Scenarios_Training_Plan.md ADDED Viewed

	@@ -0,0 +1,432 @@

+# ReplicaLab: 50 Scenario Templates and Training Plan
+## Domain Distribution
+| Domain | Count | Rationale |
+|---|---|---|
+| Computational ML/DL | 20 | Most relatable to judges, richest compute constraint space |
+| Wet-Lab Biology | 16 | Strongest replication crisis narrative, most varied equipment |
+| Quantitative Finance | 14 | Broadest appeal, most concrete measurable constraints |
+---
+## Domain 1: Computational ML/DL (20 Scenarios)
+### Cluster A: Training Replication (7 papers)
+These are "we trained a model and got results" papers. The core tension is always compute, data, and time.
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 1 | ResNet Depth Scaling on ImageNet | Deeper networks improve accuracy up to 152 layers | ResNet architecture with skip connections | 8xV100, 90 epochs, full ImageNet (1.2M images) | Lab has 1xH100, budget for 30 epochs, only ImageNet-100 subset |
+| 2 | BERT Fine-Tuning for Sentiment | BERT-large fine-tuned beats all baselines on SST-2 | BERT-large 340M params, AdamW | 4xA100 80GB, SST-2 full, 3 epochs | Lab has 1x40GB GPU, must use BERT-base or quantized BERT-large |
+| 3 | Diffusion Model for Image Synthesis | DDPM generates high-fidelity 256x256 faces | U-Net with 1000 diffusion steps | 8xA100, CelebA-HQ, 500K steps | Lab has 1xH100, budget for 100K steps, only CelebA (not HQ) |
+| 4 | RL Agent for Atari Games | PPO agent achieves superhuman on 40/57 Atari games | PPO with frame stacking | 256 CPU actors, 1xGPU learner, 200M frames | Lab has 16 CPU cores, 1xGPU, budget for 10M frames, test on 5 games only |
+| 5 | GAN Training Stability | StyleGAN2 produces photorealistic 1024x1024 output | Progressive growing, R1 regularization | 8xV100, FFHQ 70K images, 25M images shown | Lab has 1xH100, only FFHQ 10K subset, budget for 5M images shown |
+| 6 | Vision Transformer Pretraining | ViT-Large pretrained on JFT-300M matches CNN | ViT-L/16 with patch embedding | TPUv3 pod, JFT-300M (proprietary), 300 epochs | Lab has 1xH100, only ImageNet-21K (public), ViT-Base budget only |
+| 7 | LLM Instruction Tuning | SFT on curated instructions improves helpfulness | LoRA on 7B base model | 4xA100, 50K curated instructions, 3 epochs | Lab has 1xH100, only 10K public instructions (Alpaca), rank-16 LoRA max |
+### Cluster B: Evaluation/Benchmark Replication (6 papers)
+These are "we evaluated X and found Y" papers. Tension is around evaluation methodology and data access.
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 8 | LLM Benchmark Contamination | GPT-4 performance drops 12% on decontaminated MMLU | Custom decontamination pipeline | Full MMLU, GPT-4 API ($2K budget), custom regex filters | Lab has $200 API budget, must use open-source LLM, no custom decontamination tool |
+| 9 | Fairness Audit of Hiring Model | Commercial hiring model shows 23% TPR gap across demographics | Adversarial probing with synthetic candidates | Access to proprietary model API, 10K synthetic resumes, 6 demographic axes | Lab has no API access, must train proxy model, budget for 2K synthetic resumes |
+| 10 | Cross-lingual Transfer | mBERT zero-shot works for NER in 40 languages | mBERT with English-only fine-tuning | All 40 CoNLL languages, mBERT-base | Lab has compute for 10 languages, some language datasets have licensing issues |
+| 11 | OOD Detection Benchmark | Energy score beats MSP on 6 OOD benchmarks | Energy-based OOD scoring | 6 OOD datasets, ResNet-18 pretrained, custom evaluation suite | Lab missing 2 of 6 datasets (licensing), must justify subset evaluation |
+| 12 | Prompt Sensitivity Study | GPT-3.5 accuracy varies 15% across prompt formats | Systematic prompt variation, 50 formats | GPT-3.5 API ($1.5K budget), 50 prompt templates, 5 benchmarks | Lab has $300 budget, can test 15 formats on 3 benchmarks |
+| 13 | Model Compression | 4-bit quantized LLaMA-7B retains 95% of quality | GPTQ quantization | Full LLaMA-7B weights, custom GPTQ kernel, 8 benchmarks | Lab has weights but GPTQ kernel incompatible with CUDA version, must use alternative quantizer |
+### Cluster C: Method/Architecture Replication (7 papers)
+These are "we propose method X and it outperforms baselines" papers. Tension is around implementation fidelity and baseline reproduction.
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 14 | Attention Mechanism Ablation | Multi-head attention outperforms single-head by 2.1 BLEU | Transformer encoder-decoder | 4xV100, WMT14 En-De (4.5M pairs), custom tokenizer | Lab has 1xH100, WMT14 subset (1M pairs), must use HuggingFace tokenizer |
+| 15 | Contrastive Learning for Vision | SimCLR outperforms supervised pretraining with 1% labels | SimCLR with large batch (4096) | 128 TPU cores, ImageNet, batch size 4096 | Lab has 1xH100, max batch 256 (need gradient accumulation), memory constraints |
+| 16 | Graph Neural Network for Molecules | GIN outperforms GCN on molecular property prediction | Graph Isomorphism Network | 8 molecular datasets, custom data pipeline, RDKit preprocessing | Lab missing RDKit (incompatible Python version), 5 of 8 datasets available |
+| 17 | Knowledge Distillation | DistilBERT retains 97% of BERT performance at 60% size | Task-agnostic distillation | BERT-base teacher, BookCorpus+Wikipedia, 3 days training | Lab has BERT-base but BookCorpus no longer publicly available, Wikipedia only |
+| 18 | Neural Architecture Search | DARTS finds architecture matching hand-designed on CIFAR-10 | Differentiable architecture search | 1xV100 for search (1.5 days), 1xV100 for evaluation | Lab has 1xH100 (faster) but only 8 hours allocated, must reduce search space |
+| 19 | Data Augmentation | RandAugment matches AutoAugment without search cost | Random augmentation policy | ResNet-50, ImageNet, 270 epochs, grid search over N and M | Lab has compute for 90 epochs, budget for partial grid search (5 of 15 configs) |
+| 20 | Federated Learning | FedAvg converges with 100 non-IID clients | Federated averaging | 100 simulated clients, CIFAR-10, 500 communication rounds | Lab can simulate 20 clients, budget for 200 rounds, must argue this is sufficient |
+---
+## Domain 2: Wet-Lab Biology (16 Scenarios)
+### Cluster D: Cell Biology and Biochemistry (8 papers)
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 21 | Drug Cytotoxicity Dose-Response | Compound X has IC50 of 2.3 uM against HeLa cells | MTT assay, 8-point dose-response | Plate reader, MTT reagent, HeLa cells, 96-well plates, n=6 replicates | Lab plate reader booked Mon-Wed, MTT backordered (WST-1 available), budget for n=4 |
+| 22 | siRNA Knockdown Efficiency | siRNA targeting BRCA1 achieves 85% knockdown | qPCR quantification, lipofection | Real-time PCR machine, lipofectamine, BRCA1 primers, Western blot validation | qPCR machine shared (available Thu-Fri only), no Western blot antibody in stock |
+| 23 | Protein Expression and Purification | Recombinant GFP-tagged protein expressed in E. coli at 50 mg/L | IPTG induction, Ni-NTA purification | Shaker incubator, FPLC, Ni-NTA resin, IPTG, competent cells | FPLC needs maintenance (2 days), can use gravity column instead, slower but cheaper |
+| 24 | Flow Cytometry Apoptosis | Drug Y induces 60% apoptosis via Annexin V/PI staining | Flow cytometry with dual staining | Flow cytometer, Annexin V kit, PI, cell culture facility | Flow cytometer calibration expired, Annexin V kit expires in 5 days (cutting it close) |
+| 25 | Wound Healing Migration | Compound Z accelerates wound closure by 40% in 24h | Scratch assay with time-lapse imaging | Inverted microscope with camera, cell culture hood, 6-well plates, n=5 | Microscope camera resolution lower than paper (can we still quantify?), n=3 budget |
+| 26 | CRISPR Gene Editing | CRISPR-Cas9 knockout of TP53 in MCF-7 cells | CRISPR with guide RNA, Sanger sequencing | Electroporation system, guide RNA, Cas9 protein, sequencing service | Electroporation system unavailable, must use lipofection (lower efficiency expected) |
+| 27 | Enzyme Kinetics | Km of novel enzyme variant is 15 uM | Michaelis-Menten kinetics, spectrophotometric assay | UV-Vis spectrophotometer, substrate concentrations (10 points), purified enzyme | Spectrophotometer wavelength range limited, 6 concentration points max (budget) |
+| 28 | Bacterial Growth Curve | Antibiotic resistance mutation confers 3x MIC increase | Broth microdilution, OD600 measurement | Plate reader (kinetic mode), Mueller-Hinton broth, antibiotic stock, 12h monitoring | Plate reader does not support kinetic mode, must do manual timepoint readings |
+### Cluster E: Behavioral and Cognitive (4 papers)
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 29 | Ego Depletion Replication | Self-control depletion reduces performance on Stroop task | Sequential task paradigm | n=200 participants, Stroop software, two-room setup, 4 experimenters | IRB timeline 3 weeks, budget for n=80, 1 experimenter available, one room |
+| 30 | Priming Effect on Behavior | Exposure to achievement words improves puzzle performance | Scrambled sentence priming | n=150, computerized tasks, between-subjects design, debriefing protocol | n=60 budget, online-only (no in-person), must address demand characteristics |
+| 31 | Sleep and Memory Consolidation | 8h sleep improves word-pair recall by 25% vs sleep deprivation | Within-subjects, polysomnography | Sleep lab, PSG equipment, n=30, 2 sessions per participant | No sleep lab access, must use actigraphy (wrist device) as proxy, n=15 |
+| 32 | Social Conformity in Groups | Group pressure changes individual opinions 35% of the time | Asch-style paradigm with confederates | 4 trained confederates, n=100 naive participants, recording equipment | Budget for 2 confederates, n=40, must justify reduced group size |
+### Cluster F: Environmental and Ecological (4 papers)
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 33 | Soil Microbiome Diversity | Fertilizer reduces bacterial diversity by 30% | 16S rRNA sequencing, alpha diversity | Sequencing service, soil sampling kit, 20 sites, triplicate | Sequencing budget for 10 sites only, duplicate instead of triplicate |
+| 34 | Water Pollutant Detection | Novel biosensor detects lead at 5 ppb sensitivity | Electrochemical impedance spectroscopy | Potentiostat, custom electrode, calibration standards, DI water system | Potentiostat model different from paper (lower frequency range), must validate equivalence |
+| 35 | Plant Growth Under LED Spectra | Blue-enriched LED increases lettuce biomass 20% | Controlled growth chamber, spectral analysis | Growth chamber (4 compartments), LED panels, 30-day trial, 20 plants per group | Growth chamber has 2 compartments (not 4), must run sequential instead of parallel |
+| 36 | Algal Bloom Prediction | Phosphorus concentration predicts bloom onset within 5 days | Spectrophotometric phosphorus assay, regression model | Lake access permit, sampling boat, reagents for 100 samples, 6-month dataset | Permit pending (2 weeks), budget for 50 samples, 3-month window only |
+---
+## Domain 3: Quantitative Finance (14 Scenarios)
+### Cluster G: Trading Strategy Replication (6 papers)
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 37 | Momentum Factor Premium | 10-day/50-day MA crossover generates 12% annual excess return | Moving average crossover, Fama-French regression | Tick-level data, S&P 500 (20 years), Bloomberg terminal | Daily OHLCV only, 10-year window, no Bloomberg (use yfinance), survivorship bias |
+| 38 | Pairs Trading Mean Reversion | Cointegrated equity pairs yield 8% annual Sharpe 1.5 | Engle-Granger cointegration, Kalman filter | Intraday data, 200 pairs, $0.005/share commission model | Daily data, budget to test 50 pairs, commission model is $0.01/share |
+| 39 | Volatility Risk Premium | Selling VIX puts captures 4% monthly premium | Options pricing, delta hedging | Options chain data (CBOE), VIX futures, real-time Greeks | No options data subscription, must use delayed data, no real-time Greeks |
+| 40 | Earnings Momentum | Post-earnings drift persists for 60 days | Event study, CAR calculation | Earnings calendar (10 years), intraday returns around announcements | Only daily returns, 5-year earnings calendar (free source), must use wider event window |
+| 41 | Crypto Market Microstructure | Bitcoin bid-ask spread predicts 1h returns | Order book analysis, microstructure model | L2 order book data (Binance), 1-second resolution, 6 months | No L2 data, only L1 (best bid/ask) from free API, 3-month window |
+| 42 | Factor Timing with Macro Signals | Yield curve slope predicts value/growth rotation | Multi-factor model with macro overlay | Factor returns (AQR), yield curve data (FRED), 30 years | AQR data has 3-month publication lag, 20-year window from FRED, must handle shorter overlap |
+### Cluster H: Risk and Valuation Replication (4 papers)
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 43 | VaR Model Backtesting | Historical VaR at 99% underestimates tail risk by 40% | Historical simulation, 10K scenarios | 20 years of daily portfolio returns, Monte Carlo (100K paths) | 10-year data window, compute budget for 10K Monte Carlo paths, must justify reduced sample |
+| 44 | Credit Risk Transition Matrix | BBB-to-default probability is 0.3% annual (S&P estimate) | Cohort analysis of rating transitions | S&P rating database (proprietary, 30 years), 5K issuers | No S&P database, must use Moody's public reports (summary statistics only), reconstruct from aggregated data |
+| 45 | Real Estate Cap Rate Model | Cap rate spread over 10Y treasury predicts REIT returns | Regression model with macro factors | NCREIF property index, 10Y treasury (FRED), REIT returns (CRSP) | NCREIF is proprietary, must use publicly available REIT index as proxy, shorter time series |
+| 46 | Portfolio Optimization | Black-Litterman outperforms mean-variance by 200bps | Black-Litterman with investor views | Covariance matrix (60 assets, 10 years daily), equilibrium returns | Only 30 assets available (data cost), weekly instead of daily data, must address estimation error |
+### Cluster I: Behavioral Finance and Market Anomalies (4 papers)
+| # | Paper Title | Claim | Key Technique | Original Resources | Primary Constraint Tension |
+|---|---|---|---|---|---|
+| 47 | Disposition Effect in Retail Trading | Retail traders sell winners 1.5x faster than losers | Trade-level analysis of brokerage accounts | Proprietary brokerage dataset (100K accounts, 5 years) | No brokerage data, must use public datasets (Robinhood 2021 leak or academic dataset) |
+| 48 | Sentiment and Returns | Twitter sentiment predicts next-day S&P 500 direction | NLP sentiment analysis, Granger causality | Twitter firehose (1M tweets/day), FinBERT, 3 years | No Twitter firehose (API deprecated), must use Reddit or news headlines, smaller sample |
+| 49 | January Effect Persistence | Small-cap excess returns in January have declined since 1990 | Calendar anomaly study, size-sorted portfolios | CRSP daily returns (60 years), size quintile breakpoints | Only 20 years of free data (Yahoo), must construct size portfolios from available universe |
+| 50 | IPO Underpricing | Average first-day IPO return is 18% with high variance | Event study of IPO first-day returns | SEC EDGAR filings, IPO database (30 years, 5K IPOs) | Free IPO data covers 10 years only (1.5K IPOs), missing some small IPOs, survivorship concern |
+---
+## Difficulty Calibration
+Each scenario gets tagged with a difficulty. The Oracle uses this to adjust how severe the constraints are, but the base template defines the core tension.
+| Difficulty | Constraint Profile | Target Reward Range |
+|---|---|---|
+| Easy | 1-2 conflicts, clear substitutions exist, budget is 80% of needed | 6.0-8.5 |
+| Medium | 3-4 conflicts, substitutions require tradeoffs, budget is 50-70% of needed | 3.5-6.5 |
+| Hard | 5+ conflicts, substitutions are risky, budget is 30-50% of needed, time pressure | 1.5-4.5 |
+Distribution across 50 scenarios:
+- Easy: 15 (30%)
+- Medium: 20 (40%)
+- Hard: 15 (30%)
+During training, use curriculum learning: start with easy, shift to medium by iteration 5, introduce hard by iteration 10.
+---
+## What Each Scenario Template Must Define
+The Oracle generates the full scenario, but your template gives it guardrails. Each template is a compact JSON/Python dict:
+```python
+SCENARIO_TEMPLATES = {
+    "ml_resnet_depth": {
+        "id": 1,
+        "domain": "computational_ml",
+        "difficulty_range": ["easy", "medium", "hard"],
+        "paper_seed": {
+            "title": "ResNet Depth Scaling on ImageNet",
+            "claim": "Deeper networks improve accuracy up to 152 layers",
+            "technique": "ResNet with skip connections",
+            "original_compute": "8xV100, 90 epochs, full ImageNet",
+            "original_sample_size": 1281167,  # ImageNet train size
+            "original_duration": "72 hours",
+            "statistical_test": "top-1/top-5 accuracy, t-test across 3 seeds",
+            "required_controls": [
+                "baseline_shallow_model",
+                "learning_rate_schedule",
+                "data_augmentation_pipeline"
+            ],
+        },
+        "constraint_seed": {
+            "equipment_pool": ["gpu_h100", "gpu_a100_40gb", "gpu_v100", "cpu_cluster"],
+            "data_pool": ["imagenet_full", "imagenet_100", "imagenet_10pct", "cifar100_proxy"],
+            "typical_budget_range": [500, 5000],  # USD compute cost
+            "time_range_hours": [8, 72],
+            "common_bottlenecks": [
+                "gpu_memory_for_batch_size",
+                "dataset_download_time",
+                "library_version_incompatibility",
+                "checkpoint_storage"
+            ],
+            "valid_substitutions": [
+                {"original": "imagenet_full", "substitute": "imagenet_100", "validity": "acceptable_with_caveats", "caveat": "must acknowledge reduced class diversity"},
+                {"original": "8xV100", "substitute": "1xH100", "validity": "equivalent", "caveat": "adjust batch size, use gradient accumulation"},
+                {"original": "90_epochs", "substitute": "30_epochs", "validity": "inferior_but_usable", "caveat": "may not reach full convergence, report learning curve"},
+            ],
+        },
+        "scoring_hints": {
+            "critical_controls": ["baseline_shallow_model", "learning_rate_schedule"],
+            "flexible_controls": ["data_augmentation_pipeline"],
+            "min_sample_fraction": 0.1,  # at least 10% of original data
+            "power_notes": "accuracy differences < 0.5% require large n to detect",
+        },
+    },
+    # ... 49 more templates
+}
+```
+You do NOT write all 50 as fully fleshed-out dicts before the hackathon. You write 5-6 detailed templates (2 per domain) and let the Oracle interpolate the rest. The template gives the Oracle enough domain knowledge to generate a consistent scenario.
+---
+## Training Plan for 3 Hours on H100
+### The Math
+**Model:** Qwen2.5-7B-Instruct or LLaMA-3-8B-Instruct with LoRA (rank 16)
+**Method:** GRPO via TRL or Unsloth
+**GPU:** 1xH100 80GB
+**Time budget breakdown:**
+| Phase | Time | What Happens |
+|---|---|---|
+| Setup and warmup | 15 min | Load model, verify env loop, run 2 test episodes |
+| Pre-generate scenarios | 15 min | Call Oracle World Architect for all seeds, cache to disk |
+| Training | 2 hr 15 min | GRPO iterations |
+| Final evaluation | 15 min | Run eval episodes, generate reward curve |
+### Pre-Generation Phase (Critical)
+Before training starts, pre-generate and cache all scenarios you will use. This removes the Oracle API bottleneck from the training loop entirely.
+```
+50 scenario templates × 3 difficulty variants = 150 unique scenarios
+Oracle World Architect call: ~4 sec each
+Total: 150 × 4 = 600 sec = 10 minutes
+Cache all 150 to disk as JSON.
+```
+During training, `reset()` loads from cache. Zero API latency.
+### The Bottleneck Shift
+With cached scenarios, the per-episode bottleneck becomes the **Lab Manager LLM calls** (one per round). Two options:
+**Option A: LLM Lab Manager (richer but slower)**
+- 6 rounds × ~2.5 sec per LM call = 15 sec per episode for LM
+- Plus Adjudicator calls: 6 × 2.5 sec = 15 sec
+- Total API time per episode: ~30 sec
+- GPU time per episode (Scientist inference): ~2 sec
+- Wall time per episode: ~32 sec
+**Option B: Rule-based Lab Manager for training, LLM for demo (faster)**
+- 6 rounds × 0 sec API = 0 sec for LM
+- Adjudicator: can also be made deterministic for training
+- Total API time per episode: 0 sec
+- GPU time per episode: ~2 sec + ~1 sec overhead
+- Wall time per episode: ~3 sec
+**I strongly recommend Option B for training.** Use the rule-based Lab Manager and deterministic Adjudicator during RL training for speed, then switch to LLM Lab Manager and Oracle Adjudicator for demo and evaluation. The Scientist does not know the difference, it still sees the same observation schema.
+### Episodes per Hour with Option B
+| Parallel Rollouts | Episode Time | Episodes/Hour |
+|---|---|---|
+| 1 | ~3 sec | ~1,200 |
+| 4 (batch) | ~3 sec (batched inference) | ~4,800 |
+| 8 (batch) | ~3.5 sec | ~8,200 |
+With batched inference (8 parallel rollouts), you get roughly **8,000 episodes per hour**.
+### GRPO Training Schedule
+GRPO collects a batch of rollouts, computes advantages, and updates the model. Here is the schedule:
+```
+GRPO config:
+  rollout_batch_size: 32 episodes per update
+  num_iterations: 40
+  total_episodes: 32 × 40 = 1,280
+  Per iteration:
+    Rollout collection (32 episodes, 8 parallel): ~12 sec
+    Advantage computation: ~2 sec
+    Gradient update (LoRA rank 16, 7B model): ~45 sec
+    Logging and checkpoint: ~5 sec
+    Total per iteration: ~64 sec ≈ ~1 min
+  40 iterations × 1 min = 40 minutes
+```
+Wait. That is only 40 minutes. You have 2 hours 15 minutes of training time. So you can do much more:
+```
+Revised GRPO config:
+  rollout_batch_size: 64 episodes per update
+  num_iterations: 80
+  total_episodes: 64 × 80 = 5,120
+  Per iteration:
+    Rollout collection (64 episodes, 8 parallel): ~24 sec
+    Advantage computation: ~3 sec
+    Gradient update: ~55 sec
+    Logging: ~5 sec
+    Total per iteration: ~87 sec ≈ ~1.5 min
+  80 iterations × 1.5 min = 120 min = 2 hours
+```
+**Final training plan: 5,120 episodes across 80 GRPO iterations in ~2 hours.**
+### Curriculum Schedule
+| Iterations | Difficulty Mix | Domains |
+|---|---|---|
+| 1-20 | 80% easy, 20% medium | ML/DL only (most constrained, clearest signal) |
+| 21-40 | 40% easy, 50% medium, 10% hard | ML/DL + Biology |
+| 41-60 | 10% easy, 50% medium, 40% hard | All three domains |
+| 61-80 | 0% easy, 30% medium, 70% hard | All three domains, hardest scenarios |
+### Scenario Sampling During Training
+With 150 cached scenarios and 5,120 episodes, each scenario gets used ~34 times on average. But you seed the randomness, so:
+- Iteration 1-20: sample from ML easy/medium scenarios (templates 1-20, easy+medium variants = ~40 scenarios)
+- Iteration 21-40: add Biology (templates 21-36 = ~32 more scenarios)
+- Iteration 41-80: add Finance (templates 37-50 = ~28 more scenarios), shift to harder variants
+The Scientist sees enough variety to generalize while getting repeated exposure to learn each domain.
+---
+## Evaluation Plan (Final 15 Minutes)
+### Held-Out Evaluation Set
+Reserve 10 scenarios per domain (30 total) that are NEVER used during training. Different seeds, same templates but with constraint variations the Scientist has not seen.
+### Evaluation Runs
+```
+30 held-out scenarios × 1 run each = 30 episodes
+Wall time: 30 × 3 sec = 90 sec (with rule-based LM)
+Then run 5 showcase episodes with LLM Lab Manager + Oracle:
+5 × 50 sec = 250 sec ≈ 4 min
+Total eval time: ~6 minutes (well within 15 min budget)
+```
+### Metrics to Report
+| Metric | Untrained (Baseline) | Trained (Post-GRPO) |
+|---|---|---|
+| Mean total reward | Measure in Phase 2 | Measure here |
+| Mean rigor score | | |
+| Mean feasibility score | | |
+| Mean fidelity score | | |
+| Rounds to agreement | | |
+| Invalid action rate | | |
+| Contradiction rate | | |
+| Agreement rate (vs timeout) | | |
+### The Reward Curve
+Plot every 5 iterations:
+- X axis: GRPO iteration (0 to 80)
+- Y axis: mean reward over last batch
+- Include error bars (std across batch)
+- Overlay the difficulty curriculum as background color
+This is the single most important artifact for judges. It must show a clear upward trend.
+---
+## What You Actually Build Before Training
+### Day-of Priority Order
+1. **`models.py`** (30 min)
+   All Pydantic models from the Oracle guide. These are your contract.
+2. **`oracle.py`** with World Architect mode only (45 min)
+   Get scenario generation working. Test with 3 seeds. Cache results.
+3. **`replicalab_env.py`** with rule-based Lab Manager (1 hour)
+   The fast training loop. No LLM Lab Manager. Deterministic adjudicator.
+   Must pass: reset returns observation, step returns observation + reward, episode terminates.
+4. **`scoring/reward.py`** deterministic reward computation (30 min)
+   The arithmetic layer. Takes protocol + hidden spec, outputs scores.
+5. **6 detailed scenario templates** (30 min)
+   2 per domain. These seed the Oracle and serve as rule-based fallbacks.
+6. **GRPO training script** (1 hour)
+   Connect TRL/Unsloth to the env. Verify one iteration works.
+7. **Pre-generate 150 scenarios** (15 min)
+   Run the Oracle, cache everything.
+8. **Start training** (2 hours, runs while you build the demo)
+9. **`lab_manager_agent.py`** LLM version (30 min, while training runs)
+   Only used for demo. Not needed for training.
+10. **Oracle Adjudicator + Post-Mortem** (30 min, while training runs)
+    Only used for demo and eval showcase episodes.
+### What Can Run in Parallel
+While the H100 is training (2 hours), your team builds:
+- LLM Lab Manager (Person 2)
+- Oracle Adjudicator + Post-Mortem (Person 2)
+- React UI (Person 4)
+- Demo script and YouTube recording prep (Person 4)
+- FastAPI + WebSocket server (Person 3)
+- HF Space Dockerfile (Person 3)
+The H100 only needs ~30% utilization for GRPO training with LoRA. The remaining GPU capacity can run the Scientist inference for evaluation episodes simultaneously if you architect the training script to do periodic eval checkpoints.
+---
+## Summary
+| Item | Number |
+|---|---|
+| Total scenario templates | 50 |
+| ML/DL | 20 |
+| Biology | 16 |
+| Finance | 14 |
+| Cached scenario variants (with difficulty) | 150 |
+| Training episodes | 5,120 |
+| GRPO iterations | 80 |
+| Training wall time | ~2 hours |
+| Eval episodes | 30 (fast) + 5 (showcase) |
+| Total H100 time | ~2.5 hours (within 3-hour budget) |
+| Scientist model | 7B-8B with LoRA rank 16 |
+| Lab Manager (training) | Rule-based (fast) |
+| Lab Manager (demo) | LLM (rich) |
+| Oracle calls during training | 0 (all cached) |
+| Oracle calls during demo | Full (all 4 modes live) |

ReplicaLab_Architecture.mermaid ADDED Viewed

	@@ -0,0 +1,110 @@

+---
+title: ReplicaLab Architecture
+---
+flowchart TB
+    subgraph SCENARIO["Scenario Generation"]
+        direction LR
+        PT["Paper Templates<br/><i>Cell Bio · ML Benchmark · Psych</i>"]
+        CG["Constraint Generator<br/><i>Equipment · Budget · Staff · Calendar</i>"]
+        SE["Scenario Engine<br/><i>Seed → Deterministic World</i>"]
+        PT --> SE
+        CG --> SE
+    end
+    subgraph ENV["ReplicaLab Environment (OpenEnv)"]
+        direction TB
+        STATE["Environment State<br/><i>Paper · Constraints · Round · Budget<br/>Protocol · History · Done Flag</i>"]
+        subgraph AGENTS["Agent Loop"]
+            direction LR
+            SCI["🔬 Scientist Agent<br/><i>Trainable LLM Policy</i><br/><b>Actions:</b> propose · revise<br/>ask · accept"]
+            LM["🏗️ Lab Manager Agent<br/><i>Rule-Based Policy</i><br/><b>Actions:</b> report · suggest<br/>reject · accept"]
+            SCI -- "Proposal /<br/>Question" --> LM
+            LM -- "Constraint /<br/>Substitution" --> SCI
+        end
+        subgraph JUDGE["Judge Engine"]
+            direction LR
+            RUBRIC["Rubric Scorer<br/><i>Deterministic</i>"]
+            EXPLAIN["Explanation Layer<br/><i>Optional LLM</i>"]
+            RUBRIC --> EXPLAIN
+        end
+        STATE --> AGENTS
+        AGENTS -- "step()" --> STATE
+        STATE -- "Episode End" --> JUDGE
+    end
+    subgraph REWARD["Reward Computation"]
+        direction LR
+        R["Rigor<br/>Score"]
+        FE["Feasibility<br/>Score"]
+        FI["Fidelity<br/>Score"]
+        BONUS["Efficiency +<br/>Communication<br/>Bonus"]
+        PEN["Penalties<br/><i>Timeout · Over Budget<br/>Missing Controls</i>"]
+        TOTAL["<b>Total Reward</b><br/><i>10 × R × Fe × Fi<br/>+ Bonus − Penalties</i>"]
+        R --> TOTAL
+        FE --> TOTAL
+        FI --> TOTAL
+        BONUS --> TOTAL
+        PEN --> TOTAL
+    end
+    subgraph TRAINING["RL Training Pipeline"]
+        direction LR
+        COLAB["Google Colab<br/><i>TRL / Unsloth · GRPO</i>"]
+        ROLLOUT["Rollout Loop<br/><i>reset() → step() → reward</i>"]
+        CURVES["Reward Curves<br/><i>Before vs After</i>"]
+        COLAB --> ROLLOUT --> CURVES
+    end
+    subgraph SERVING["Deployment & Serving"]
+        direction LR
+        FASTAPI["FastAPI +<br/>WebSocket Server"]
+        DOCKER["Docker Container"]
+        HF["Hugging Face Space<br/><i>sdk: docker · port: 7860</i>"]
+        FASTAPI --> DOCKER --> HF
+    end
+    subgraph UI["Frontend"]
+        direction LR
+        REACT["React + Vite UI"]
+        FALLBACK["OpenEnv /web<br/><i>Fallback</i>"]
+        subgraph PANELS["Layout"]
+            direction TB
+            LEFT["Left Panel<br/><i>Paper · Seed · Round</i>"]
+            MID["Middle Panel<br/><i>Negotiation Log</i>"]
+            RIGHT["Right Panel<br/><i>Protocol · Budget · Scores</i>"]
+        end
+        REACT --> PANELS
+        FALLBACK --> PANELS
+    end
+    SE -- "reset(seed)" --> ENV
+    JUDGE -- "Scores" --> REWARD
+    TOTAL -- "Reward Signal" --> TRAINING
+    ROLLOUT -- "Episodes" --> ENV
+    ENV -- "API" --> FASTAPI
+    FASTAPI -- "WebSocket" --> REACT
+    FASTAPI -- "WebSocket" --> FALLBACK
+    TRAINING -. "Updated Scientist<br/>Policy Weights" .-> SCI
+    classDef scenario fill:#3b82f6,stroke:#1d4ed8,color:#fff
+    classDef env fill:#1e293b,stroke:#475569,color:#e2e8f0
+    classDef agent fill:#8b5cf6,stroke:#6d28d9,color:#fff
+    classDef judge fill:#f59e0b,stroke:#d97706,color:#1e293b
+    classDef reward fill:#10b981,stroke:#059669,color:#fff
+    classDef training fill:#ef4444,stroke:#dc2626,color:#fff
+    classDef serving fill:#6366f1,stroke:#4f46e5,color:#fff
+    classDef ui fill:#ec4899,stroke:#db2777,color:#fff
+    classDef panel fill:#fdf2f8,stroke:#ec4899,color:#1e293b
+    class PT,CG,SE scenario
+    class STATE env
+    class SCI,LM agent
+    class RUBRIC,EXPLAIN judge
+    class R,FE,FI,BONUS,PEN,TOTAL reward
+    class COLAB,ROLLOUT,CURVES training
+    class FASTAPI,DOCKER,HF serving
+    class REACT,FALLBACK ui
+    class LEFT,MID,RIGHT panel

ReplicaLab_Architecture.svg ADDED Viewed

Git LFS Details

SHA256: 03bcb908e384bf7064f2252b21ff427b511706c727040b57155c512b137bafd4
Pointer size: 130 Bytes
Size of remote file: 23.6 kB

ReplicaLab_Architecture_Final.svg ADDED Viewed

Git LFS Details

SHA256: fecb2c1df4e87cc0a4f5aeb90c9301bf22abf461d1c21e3c4a055340ab79f37f
Pointer size: 130 Bytes
Size of remote file: 38.8 kB

ReplicaLab_Architecture_v2.svg ADDED Viewed

Git LFS Details

SHA256: 8386be0e95e12529cc73b19bc6cf8743e598c13783701b3a9cce2fecfdcb7621
Pointer size: 130 Bytes
Size of remote file: 12 kB

ReplicaLab_Architecture_v2_polished.svg ADDED Viewed

Git LFS Details

SHA256: 7ec5d836df158b93ef2d1a3ed640ca408ecd46e7a1ce6f355a8a7ed54b278cda
Pointer size: 130 Bytes
Size of remote file: 25 kB

ReplicaLab_Blueprint.md ADDED Viewed

	@@ -0,0 +1,426 @@

+# ReplicaLab
+**A multi-agent scientific replication environment built on OpenEnv**
+---
+## Overview
+ReplicaLab is a virtual scientific replication world. Each episode generates an original experiment and a constrained lab, then two agents negotiate a replication plan:
+- A **Scientist** agent that protects scientific validity.
+- A **Lab Manager** agent that protects cost, equipment, time, staffing, and feasibility.
+They negotiate over multiple rounds. If they converge on a sound, feasible protocol, the episode yields a high reward. If they fail, overspend, or strip away critical scientific elements, the reward stays low.
+The real-world motivation is the **replication crisis**: published protocols describe ideal conditions, but real labs face missing tools, tight budgets, booking conflicts, reagent shortages, and limited personnel. ReplicaLab trains an agent to answer a single question:
+> *How do we adapt an experiment without breaking the science?*
+---
+## Hackathon Track Alignment
+ReplicaLab touches four of the five OpenEnv Hackathon problem statements.
+### Primary Tracks
+| Track | Fit |
+|---|---|
+| **Multi-Agent Interactions** | Two roles hold different private information and must negotiate toward consensus. Strongest fit. |
+| **World Modeling (Professional)** | The agent reasons inside a professional world with hidden constraints. Very strong fit. |
+### Supporting Tracks
+| Track | Fit |
+|---|---|
+| **Long-Horizon Planning** | The agent must ask, revise, recover, and converge across multiple rounds rather than solving in one step. |
+| **Self-Improvement** | The same environment trains the Scientist so its behavior improves over repeated episodes. |
+**Demo framing:** Lead with Multi-Agent + World Modeling. Support with Long-Horizon + Self-Improvement.
+---
+## Why This Is an Environment
+ReplicaLab is not a prompt. It satisfies all five properties of a proper environment:
+1. **State** — Current paper, lab constraints, round number, negotiation history, proposed protocol, spent budget, remaining stock, done flag.
+2. **Actions** — The Scientist can propose, revise, ask questions, or accept. The Lab Manager can report feasibility, suggest substitutions, reject, or accept.
+3. **Transitions** — Each action mutates the world: budget consumed, protocol updated, round counter incremented, dialogue history extended.
+4. **Observations** — Each role sees a different partial view of the world (partially observable).
+5. **Reward** — The environment scores the quality of the final plan.
+OpenEnv provides exactly this pattern: typed `Action`, `Observation`, and `State` models with `reset()`, `step()`, and `state()` methods, wrapped in FastAPI + WebSocket serving with per-session instances.
+---
+## Episode Lifecycle
+A single episode unfolds as follows:
+1. **Reset** — `reset(seed=42)` creates one paper template, one lab constraint set, and one hidden evaluation rubric.
+2. **Scientist observes** — Paper summary, experiment goal, conversation history, current proposed protocol.
+3. **Lab Manager observes** — Budget, equipment, booking calendar, reagents, staff, safety rules, current proposal.
+4. **Scientist acts** — Proposes, revises, asks, or accepts.
+5. **Lab Manager responds** — Reports feasibility, suggests substitutions, or accepts.
+6. **State updates** — Environment transitions.
+7. **Repeat** for a fixed number of rounds or until both sides accept (or timeout).
+8. **Reward returned** — The environment scores the final protocol.
+### Key Design Decision
+For the MVP, only the **Scientist is trained**.
+| Role | Implementation |
+|---|---|
+| **Scientist** | Trainable LLM policy |
+| **Lab Manager** | Deterministic rule-based policy with readable responses |
+| **Judge** | Deterministic rubric engine, with optional LLM explanation layer |
+This gives stable environment dynamics and clean reward signals for a hackathon setting.
+---
+## The Three Roles
+### A. Scientist Agent
+The Scientist protects scientific quality. It reasons about essential controls, safe sample-size reductions, valid substitutions, and the minimum viable version of an experiment that still tests the claim.
+**Action schema:**
+```json
+{
+  "action_type": "propose_protocol | revise_protocol | request_info | accept",
+  "sample_size": 60,
+  "controls": ["vehicle_control", "positive_control"],
+  "technique": "WST1",
+  "duration_days": 7,
+  "required_equipment": ["plate_reader", "incubator"],
+  "required_reagents": ["drug_A", "WST1_kit"],
+  "questions": ["Do we have a plate reader free this week?"],
+  "rationale": "WST1 is an acceptable substitute for MTT in this template"
+}
+```
+### B. Lab Manager Agent
+The Lab Manager protects feasibility: budget, equipment availability, machine bookings, reagent delivery timelines, and staffing. For the MVP this is a rule-based system (deterministic constraint checker, substitution suggester, cost estimator, booking checker, natural-language response template) to keep environment behavior stable and debuggable.
+### C. Judge
+The Judge is a **rubric-backed scorer**, not a free-form LLM.
+It receives the original paper, hidden minimum-viable replication spec, final proposed protocol, actual lab constraints, and negotiation transcript. It outputs:
+- Rigor score
+- Feasibility score
+- Fidelity score
+- Final reward
+- Audit notes
+An optional LLM explanation layer can translate the audit into readable notes for the UI.
+---
+## Reward Structure
+### Core Dimensions
+| Dimension | What It Measures | Examples |
+|---|---|---|
+| **Rigor** | Did the agent preserve the important science? | Sample size, controls, method validity, statistics, duration |
+| **Feasibility** | Can this lab actually run the plan? | Budget, equipment availability, stock, timeline, staffing |
+| **Fidelity** | How close is the plan to the original experiment? | Same technique or valid substitute, same control logic, similar sample size, same study aim |
+### Formula
+```
+total_reward = 10 × rigor × feasibility × fidelity
+             + efficiency_bonus
+             + communication_bonus
+             − penalties
+```
+The multiplicative core prevents fake wins: a scientifically perfect but impossible plan scores low, and a cheap but scientifically broken plan also scores low.
+### Penalties
+Applied for timeout, exceeding budget, invalid structure, missing critical controls, and bad substitutions.
+---
+## Reinforcement Learning
+RL improves the **Scientist policy**.
+1. Environment resets.
+2. Scientist generates an action.
+3. Lab Manager replies.
+4. Episode ends with a reward.
+5. Training loop adjusts the Scientist toward higher-reward behaviors.
+**Target behaviors over training:**
+- Ask better questions before committing.
+- Preserve critical controls.
+- Choose realistic substitutions.
+- Reach agreement faster.
+- Avoid over-budget plans.
+TRL supports OpenEnv-style training through a custom `rollout_func` for stepping through an environment with environment-computed rewards. GRPO supports custom reward functions. Unsloth provides GRPO notebooks designed for this kind of training.
+---
+## Self-Improvement
+For the MVP, self-improvement means the Scientist gets measurably better through repeated episodes. That is sufficient for the track.
+**Stretch goals (time permitting):**
+- **Curriculum learning** — Easy scenarios first, then medium, then hard.
+- **Self-critique** — After a failed episode, the agent reviews a short audit and retries.
+- **Self-play** — Train both Scientist and Lab Manager.
+---
+## World Modeling and Long-Horizon Planning
+### World Modeling
+The agent must build an internal model of a hidden world: what the lab has, what it lacks, what is booked, what is scientifically critical, what is flexible, and how choices affect future feasibility. None of this is fully visible, so the agent infers the world through negotiation.
+### Long-Horizon Planning
+The best move is rarely the first move. A strong Scientist follows a chain: understand the paper goal, ask what is available, propose a first plan, revise after constraints surface, trade off cost against rigor, and reach agreement before timeout. That is multi-step planning, not a single answer.
+---
+## Constraint System
+Constraints come from a **scenario generator**. Each scenario template defines required equipment, optional substitutes, must-keep controls, minimum sample size, minimum duration, typical costs, and likely bottlenecks. Difficulty modifies them:
+| Difficulty | Description |
+|---|---|
+| **Easy** | Lab has most of what is needed. |
+| **Medium** | Some missing items, tighter budget, tighter time. |
+| **Hard** | Major shortages, bigger tradeoffs, booking conflicts. |
+For the MVP, the world is **deterministic within each episode**: the initial seed defines the entire scenario, resources change only through agent choices, and there are no random surprise events. This makes debugging, replay, and demo presentations much stronger.
+---
+## Interface Design
+### Layout
+| Section | Content |
+|---|---|
+| **Left Panel** | Original paper summary, challenge label, seed, round counter |
+| **Middle Panel** | Negotiation log (Scientist in blue, Lab Manager in green, Judge audit at end) |
+| **Right Panel** | Current proposed protocol, lab inventory snapshot, budget bar, score bars for rigor/feasibility/fidelity |
+| **Bottom Controls** | New episode, seed selector, scenario selector, replay slider, before-vs-after training toggle |
+### Implementation
+- **Demo UI:** Custom React + Vite app hitting the FastAPI + WebSocket backend.
+- **Fallback UI:** OpenEnv built-in `/web` interface.
+---
+## Folder Structure
+```
+replicalab/
+├── README.md
+├── pyproject.toml
+├── openenv.yaml
+├── .dockerignore
+├── replicalab/
+│   ├── __init__.py
+│   ├── models.py
+│   ├── client.py
+│   ├── prompts/
+│   │   ├── scientist.txt
+│   │   ├── lab_manager.txt
+│   │   └── judge.txt
+│   ├── scenarios/
+│   │   ├── templates.py
+│   │   ├── cell_biology.py
+│   │   ├── ml_benchmark.py
+│   │   └── behavioral_psych.py
+│   ├── scoring/
+│   │   ├── rubric.py
+│   │   ├── rigor.py
+│   │   ├── feasibility.py
+│   │   └── fidelity.py
+│   ├── agents/
+│   │   ├── scientist_policy.py
+│   │   ├── lab_manager_policy.py
+│   │   └── judge_policy.py
+│   ├── env/
+│   │   └── replicalab_env.py
+│   └── utils/
+│       ├── seed.py
+│       ├── validation.py
+│       └── logging.py
+├── server/
+│   ├── app.py
+│   ├── requirements.txt
+│   └── Dockerfile
+├── frontend/
+│   ├── package.json
+│   ├── vite.config.ts
+│   └── src/
+│       ├── App.tsx
+│       ├── components/
+│       └── pages/
+├── notebooks/
+│   └── train_colab.ipynb
+└── tests/
+    ├── test_env.py
+    ├── test_reward.py
+    ├── test_scenarios.py
+    └── test_server.py
+```
+---
+## Toolchain
+| Tool | Purpose |
+|---|---|
+| **OpenEnv 0.2.1** | Environment class and server |
+| **Hugging Face Spaces** | Public hosting (Docker SDK, port 7860) |
+| **Docker** | Packaging server + frontend |
+| **Google Colab** | Required training notebook |
+| **TRL / Unsloth** | RL training on the Scientist |
+| **FastAPI + WebSocket** | Live environment serving |
+| **React + Vite** | Frontend |
+| **Tailwind + shadcn/ui** | Styling |
+| **Matplotlib** | Reward curves in Colab |
+| **CSV / JSONL logs** | Replay and debugging |
+---
+## Scope
+### In Scope (MVP)
+1. One working OpenEnv environment
+2. Three scenario templates (Cell Biology, ML Benchmark, Behavioral Psychology)
+3. Trainable Scientist agent
+4. Rule-based Lab Manager
+5. Judge rubric engine
+6. Reward logging
+7. HF Space deployment
+8. Colab RL notebook with reward curve
+9. Public repo
+10. One-minute YouTube demo
+11. Clean README
+12. React UI or polished `/web` fallback
+### Stretch (Only If Ahead)
+- LLM Lab Manager
+- Live replay mode
+- Side-by-side before-vs-after comparison
+- More scenario families
+- Judge explanation LLM
+- Curriculum learning
+### Out of Scope
+- Proving a real paper is true or false
+- Parsing arbitrary papers from the internet
+- Full autonomous lab automation
+- Real wet-lab execution
+- Full multi-model self-play
+- Enterprise workflow integrations
+---
+## Team Roles (4 People)
+| Person | Ownership |
+|---|---|
+| **P1: Environment + Reward** | Scenario engine, environment state, constraint logic, reward logic, tests |
+| **P2: RL + Model** | Scientist policy prompt, TRL/Unsloth notebook, rollout loop, reward curves, before/after evaluation |
+| **P3: Backend + Deploy** | FastAPI, WebSocket, Docker, HF Space, logging, replay API |
+| **P4: Frontend + Story** | React/Vite UI, visualization, demo flow, README, YouTube demo |
+Everyone shares bug fixing, testing, and final polish.
+---
+## Build Sequence
+1. Freeze the environment schema
+2. Implement one scenario end to end
+3. Add reward and logs
+4. Add rule-based Lab Manager
+5. Add Scientist baseline
+6. Connect Colab training
+7. Add React UI
+8. Deploy to HF
+9. Record demo
+10. Write README
+---
+## Judging Criteria and Demo Strategy
+| Criterion (Weight) | How ReplicaLab Scores |
+|---|---|
+| **Environment Innovation (40%)** | Partially observable, multi-role scientific negotiation world, not a toy chat task. |
+| **Storytelling (30%)** | Scientist vs. Lab Manager is instantly understandable. |
+| **Training Improvement (20%)** | Same seed, before training vs. after training, visible reward improvement. |
+| **Pipeline Setup (10%)** | Clean reward formula, structured logs, reproducible Colab notebook. |
+### Demo Flow
+1. New episode with a specific seed.
+2. Paper appears, Scientist proposes.
+3. Lab Manager pushes back.
+4. Negotiation unfolds over rounds.
+5. Judge shows final scores.
+6. Replay same seed with the trained model.
+7. Trained model asks smarter questions, avoids bad substitutions, earns higher reward.
+---
+## Success Metrics
+| Metric | Untrained Scientist | Trained Scientist |
+|---|---|---|
+| Average reward | Lower | Higher |
+| Rounds to agreement | More | Fewer |
+| Invalid action rate | Higher | Lower |
+| Agreement rate | Lower | Higher |
+---
+## Sponsor Alignment
+| Target | Rationale |
+|---|---|
+| **Halluminate** | True multi-actor environment with different beliefs and information per role. |
+| **Snorkel AI** | Simulated experts in the loop; the Scientist learns by interacting with expert-style roles. |
+| **Fleet AI** (alternate) | Judge as an explicit oversight layer monitoring and explaining the two agents. |
+---
+## Real-World Applications
+**Target users:** Biotech teams, pharma R&D groups, contract research organizations, university labs, cloud lab platforms, AI labs training scientific agents.
+**Potential revenue paths:** Enterprise experiment planning software, evaluation benchmark licensing, simulation API access, experiment design copilot products.
+---
+## The Simple Explanation
+Imagine two kids want to bake a cake. One knows the **recipe**. The other knows what is in the **kitchen**. The recipe kid says they need eggs, milk, flour, and chocolate. The kitchen kid says there is no chocolate, but there is cocoa. They talk and make the best cake they can. If the cake stays tasty, uses what the kitchen has, and finishes on time, they earn a star.
+ReplicaLab is that, but for science.

ReplicaLab_Comprehensive_Task_Division.md ADDED Viewed

	@@ -0,0 +1,996 @@

+# ReplicaLab Comprehensive Task Division and Delivery Backlog
+## 1. Document purpose
+This document is the working blueprint for building **ReplicaLab** in a hackathon setting with a **4 person team**. It is written like a lightweight real world delivery plan with:
+1. Product scope
+2. Team ownership
+3. Module and function ownership
+4. Epics
+5. User stories
+6. Lowest level tasks
+7. Dependencies
+8. Acceptance criteria
+9. Delivery workflow
+10. Definition of done
+The goal is to let any team member pick up work immediately without confusion.
+---
+## 2. Product summary
+**ReplicaLab** is an OpenEnv environment where a **Scientist agent** and a **Lab Manager agent** negotiate how to solve a constrained technical task under real world limits such as budget, tools, compute, schedule, stock, and staffing.
+The environment is used to **train the Scientist agent with reinforcement learning** so it learns to ask better questions, preserve objective quality, use bounded evidence tools correctly, and produce more feasible plans under domain-specific constraints.
+The first domain focus is:
+1. Mathematics
+2. Machine learning
+3. Finance and trading design in offline or backtest form
+Physics and biology remain follow-on adapters once the normalized scenario layer is stable.
+### The judged MVP outcome
+By judging time, the project should demonstrate:
+1. A working OpenEnv environment deployed on Hugging Face Spaces on port `7860`
+2. At least one full scenario family working end to end, with a target of three
+3. A Scientist agent that can interact with the environment through structured actions and bounded evidence tools
+4. A hybrid model-backed Lab Manager with deterministic feasibility grounding and bounded validation tools
+5. A deterministic judge and reward engine
+6. A Colab training notebook plus reusable H100 job path using Unsloth or HF TRL
+7. A reward curve showing improvement
+8. A public GitHub repository
+9. A one minute YouTube demo
+10. A README with architecture, setup, and results
+---
+## 3. Scope control
+## 3.1 In scope for the hackathon MVP
+1. OpenEnv environment implementation
+2. FastAPI and WebSocket serving
+3. Hugging Face Docker Space deployment
+4. Scientist agent with structured JSON action output plus bounded search, code-check, and image-inspection capability
+5. Hybrid model-backed Lab Manager grounded by deterministic feasibility checks plus bounded validation tools
+6. Judge rubric engine with deterministic scoring
+7. Three scenario families for MVP
+   1. Mathematics reasoning and proof planning
+   2. ML benchmark replication
+   3. Finance or trading backtest planning
+8. Frozen evidence packs for deterministic training plus limited live validation during demo or eval
+9. Reward logging
+10. Replay logs
+11. Colab RL notebook
+12. Reward curve image
+13. Thin React plus Vite frontend or OpenEnv `/web` fallback
+14. README, demo video, submission package
+## 3.2 Out of scope for the hackathon MVP
+1. Proving whether a real research paper is globally true or false
+2. Unrestricted parsing of arbitrary live internet content inside the training loop
+3. Real wet lab execution
+4. Live trading or production finance execution
+5. Real time collaboration features
+6. Training both Scientist and Lab Manager in self play
+7. Open-ended autonomous coding outside a bounded verification or analysis sandbox
+8. Image generation or audio capabilities in the agent policy loop
+9. Complex third party enterprise integrations
+10. Full multi-domain rollout unless time remains
+11. Manager-led subagent orchestration unless the MVP is already stable
+---
+## 4. Team structure and role ownership
+| Role | Owner focus | Primary responsibilities | Secondary responsibilities |
+| --- | --- | --- | --- |
+| Person A | Environment and Scoring Lead | scenario engine, constraint logic, reward logic, state transitions, tests | supports judge audit text |
+| Person B | RL and Agent Lead | Scientist prompting, action schemas, training loop, rollouts, evaluation, reward curves | supports lab manager templating |
+| Person C | Backend and Infra Lead | FastAPI server, WebSocket handling, Docker, HF Space deploy, logs, replay endpoints | supports local dev scripts |
+| Person D | Frontend and Storytelling Lead | React plus Vite UI, live negotiation display, replay viewer, README, demo flow, video assets | supports final integration testing |
+### Shared responsibilities
+| Shared area | Expectation |
+| --- | --- |
+| Git hygiene | every feature goes through branch plus PR |
+| Integration | merge to main only after quick smoke test |
+| Testing | each owner writes tests for their workstream |
+| Storytelling | everyone contributes screenshots, gifs, examples |
+| Submission readiness | all four review final demo, notebook, README, repo visibility |
+## 4.1 Training compute and model selection
+1. The team has access to an H100 GPU for heavier Scientist and Lab Manager training and evaluation runs.
+2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
+3. The judged artifact remains the Colab notebook, but the primary heavy-runtime path is now a Northflank H100 GPU job with persistent volume checkpoints and caches.
+4. Person C supports any environment URL, secret, volume, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
+### Trainable model
+The primary shared base model for the current training iteration is
+**Qwen3.5-9B**.
+| Model | Role | Rationale |
+| --- | --- | --- |
+| Qwen3.5-9B | Primary shared base for Scientist and Lab Manager adapters | Fits the Northflank H100 plan, upgrades the repo from the older Qwen3-8B baseline, and keeps both trainable role artifacts on one model family. |
+| Qwen3.5-4B | Reduced-scale fallback | Use for Colab or lower-memory debug runs when faster iteration matters more than final V2 quality. |
+| Qwen3.5-122B-A10B | Audit-only judge candidate | Useful for qualitative post-run analysis, but not part of the deterministic training reward loop. |
+### Evaluator layer
+The training reward is always the **deterministic rubric engine** defined in E05. Anthropic is the active hosted oracle provider for post-episode explanation, scenario enrichment, and demo audit only. The frontier evaluator is never part of the training reward loop.
+### MVP role implementations
+| Role | MVP implementation | Future stretch |
+| --- | --- | --- |
+| Scientist | Trainable GRPO policy (`Qwen3.5-9B` + LoRA) | Larger model distillation or curriculum extensions |
+| Lab Manager | Deterministically grounded role with a trainable SFT adapter on `Qwen3.5-9B` | Manager orchestrator with specialist subagents and richer role-specific adapters |
+| Judge (training reward) | Deterministic rubric engine | Unchanged |
+| Judge (explanation layer) | Optional large-model audit layer such as `Qwen3.5-122B-A10B` or Anthropic | Extended explanation panel in UI |
+## 4.2 Domain roadmap and normalized scenario layer
+The frozen outer action and observation contract from `FND 08`, `MOD 01`, `MOD 02`, and `MOD 03` remains stable. Domain expansion happens below that contract through a normalized scenario layer.
+The internal data flow is:
+`scenario adapter -> normalized scenario pack -> observation mapper -> ScientistObservation or LabManagerObservation`
+Every scenario family must emit the same normalized scenario pack with, at minimum:
+1. `domain_id`
+2. `task_summary`
+3. `success_criteria`
+4. `constraints`
+5. `resources`
+6. `allowed_substitutions`
+7. `hidden_reference_spec`
+8. `scenario_id`
+9. `seed`
+Rules for the normalized scenario layer:
+1. Domain-specific logic belongs in thin adapters, not in prompts or reward code.
+2. Prompts must be assembled from the normalized scenario pack, not hard-coded to one domain.
+3. Difficulty and curriculum changes should mechanically alter constraints, resources, or conflicts rather than fork separate prompt logic.
+4. The deterministic scorer compares the final agreed plan against `hidden_reference_spec`; model-backed roles never own truth.
+For the bounded-tool MVP, pending scenario and environment work will extend the
+existing normalized scenario pack with additive evidence fields. This is an
+extension below the frozen outer contract, not a reopening of `FND 08`,
+`MOD 01`, `MOD 02`, or `MOD 03`.
+Tool-capable scenario extensions:
+1. `evidence_pack`
+2. `artifact_refs`
+3. `allowed_tools`
+4. `tool_budget`
+5. `validation_policy`
+## 4.3 Bounded tool capability policy
+The richer-capability MVP keeps the final outward action contract stable while
+adding bounded tools below it.
+### Scientist allowed capabilities
+1. `search_evidence`
+   - retrieve supporting facts, benchmark rules, paper details, or official references
+   - not a reward source
+2. `run_code_check`
+   - bounded code or config analysis, metric checks, value generation, runtime or cost estimation
+3. `inspect_image`
+   - read tables, plots, figures, screenshots, and charts for evidence extraction
+### Lab Manager allowed capabilities
+1. `search_resources`
+   - retrieve resource, policy, benchmark, or documentation constraints
+2. `run_code_check`
+   - validate cost, runtime, config, reproducibility, or execution assumptions
+3. `inspect_image`
+   - inspect figures, charts, and screenshots relevant to feasibility or policy review
+### Judge capability rules
+1. The judge reward remains deterministic and must not depend on live web search.
+2. Tool traces and evidence references may inform deterministic penalties, bonuses, or audit text.
+3. The judge may use bounded evidence verification for demo or audit text, but never as the training reward source.
+### Training and demo rules
+1. Training uses frozen evidence packs and deterministic tool traces whenever possible.
+2. Live web search is limited to demo-time or eval-time validation, not the core training reward loop.
+3. Image generation and audio are excluded from the policy loop for the hackathon MVP.
+4. Coding capability must stay sandboxed and task-scoped rather than open-ended.
+---
+## 5. Module and function ownership map
+| Module or file | Key functions or classes | Owner | Notes |
+| --- | --- | --- | --- |
+| `replicalab/models.py` | `ScientistAction`, `LabManagerAction`, `Observation`, `StepResult`, `EpisodeState`, `EpisodeLog` | Person A with Person B | shared contract file |
+| `replicalab/scenarios/templates.py` | `generate_scenario()`, `load_template()`, `apply_difficulty()`, `seed_rng()` | Person A | central normalized scenario factory and mapper inputs |
+| `replicalab/scenarios/math_reasoning.py` | `build_math_reasoning_template()` | Person A | first structured reasoning scenario |
+| `replicalab/scenarios/ml_benchmark.py` | `build_ml_benchmark_template()` | Person A | first reproducible compute scenario |
+| `replicalab/scenarios/finance_trading.py` | `build_finance_trading_template()` | Person A | offline strategy and backtest planning only |
+| `replicalab/agents/scientist_policy.py` | `build_scientist_prompt()`, `parse_scientist_output()` | Person B | trainable role |
+| `replicalab/agents/lab_manager_policy.py` | `generate_lab_manager_response()`, `check_feasibility()` | Person B with Person A | model-backed negotiation grounded by deterministic checker |
+| `replicalab/agents/judge_policy.py` | `explain_judgement()` optional only | Person A | explanation layer only |
+| `replicalab/tools/search.py` | `search_evidence()`, `search_resources()` | Person B with Person C | bounded retrieval and validation only |
+| `replicalab/tools/code_tools.py` | `run_code_check()` | Person B | bounded code analysis, config checks, and derived-value generation |
+| `replicalab/tools/image_tools.py` | `inspect_image()` | Person B with Person D | bounded table, chart, figure, and screenshot inspection |
+| `replicalab/scoring/rigor.py` | `score_rigor()` | Person A | deterministic |
+| `replicalab/scoring/feasibility.py` | `score_feasibility()` | Person A | deterministic |
+| `replicalab/scoring/fidelity.py` | `score_fidelity()` | Person A | deterministic |
+| `replicalab/scoring/rubric.py` | `compute_total_reward()`, `build_reward_breakdown()` | Person A | core reward |
+| `replicalab/utils/validation.py` | `validate_protocol()`, `validate_vocab()` | Person A | schema and semantic checks |
+| `replicalab/utils/logging.py` | `write_episode_log()`, `write_reward_csv()` | Person C | logging helpers |
+| `replicalab/env/replicalab_env.py` | `ReplicaLabEnv.reset()`, `step()`, `state()`, `close()` | Person A | OpenEnv environment |
+| `server/app.py` | `create_app()`, REST routes, WebSocket handler | Person C | runtime entrypoint |
+| `server/Dockerfile` | build and run app | Person C | deployment |
+| `frontend/src/App.tsx` | app shell | Person D | UI root |
+| `frontend/src/components/*` | paper panel, log panel, score panel, controls, replay, judge audit | Person D | UI components |
+| `frontend/vite.config.ts` | dev proxy and build output config | Person C with Person D | frontend and backend integration |
+| `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | theme tokens and CSS pipeline | Person D | matches declared styling stack |
+| `notebooks/train_colab.ipynb` | setup, connect, rollout, train, plot | Person B | judged asset |
+| `replicalab/training/*.py` | reusable dataset, GRPO, SFT, evaluation, plotting, and job-entrypoint helpers | Person B | shared by notebook, Northflank H100 jobs, and evaluation scripts |
+| `tests/*` | unit and integration tests | all | each owner covers own modules |
+| `openenv.yaml` | environment registration and server config | Person A | required for OpenEnv discovery |
+| `replicalab/config.py` | `MAX_ROUNDS`, `DEFAULT_DIFFICULTY`, `TIMEOUT_SECONDS`, `MAX_BUDGET` | Person A | single source of truth for constants |
+| `replicalab/client.py` | `ReplicaLabClient.connect()`, `reset()`, `step()`, `close()` | Person B | reusable by notebook and external consumers |
+| `replicalab/utils/seed.py` | `seed_rng()`, `get_deterministic_seed()` | Person A | shared by scenarios and env |
+| `replicalab/prompts/*.txt` | role prompt templates | Person B | loadable domain-neutral text files assembled from normalized scenario data |
+| `replicalab/outputs/` | `logs/`, `replays/`, `plots/` | Person C | gitignored output directories |
+| `server/requirements.txt` | pinned runtime dependencies | Person C | standalone server install |
+| `README.md` | project story, setup, results | Person D with all | judged asset |
+---
+## 6. Delivery phases
+| Phase | Goal | Exit condition |
+| --- | --- | --- |
+| Phase 0 | contracts and scaffolding | repo, schema, branch rules, basic app skeleton |
+| Phase 1 | one working scenario end to end | reset, step, reward, logs work locally |
+| Phase 2 | deployable environment | FastAPI, Docker, HF Space live |
+| Phase 3 | trainable loop | Colab notebook connects and shows non flat rewards |
+| Phase 4 | compelling demo | UI, replay, reward breakdown, README, video |
+| Phase 5 | hardening | smoke tests, bug fixes, final submission review |
+---
+## 7. Operating workflow
+## 7.1 Branching model
+| Branch type | Example | Rule |
+| --- | --- | --- |
+| main | `main` | always demo safe |
+| feature | `feature/env-reset-loop` | one feature per branch |
+| hotfix | `hotfix/ws-timeout-fix` | used only for urgent breaks |
+## 7.2 PR checklist
+Every PR must include:
+1. linked task ID
+2. summary of change
+3. screenshots or logs if UI or environment behavior changed
+4. quick test result
+5. note on any schema or API changes
+## 7.3 Integration cadence
+1. Sync at start of day
+2. Merge every 2 to 3 hours if stable
+3. End of block smoke test on:
+   1. local reset
+   2. one full episode
+   3. frontend load
+   4. notebook connection if applicable
+---
+## 8. Epic backlog
+### Status legend
+- `✅ Completed`
+- `❌ Failed`
+- `🟡 Partial`
+- `⬜ Not started`
+- `Completed by`: fill this only when the finisher is different from the assigned owner; otherwise use `—`
+---
+## Epic E01. Foundations and repository setup
+### Epic goal
+Create a stable shared codebase, contracts, and development workflow so all workstreams can proceed in parallel.
+### Current status
+- `FND 01` status: completed on 2026-03-07
+- `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
+- `FND 02` status: completed on 2026-03-08
+- `FND 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
+- `FND 04` status: completed on 2026-03-08
+- `FND 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `FND 05` status: completed on 2026-03-08
+- `FND 05` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
+- `FND 06` status: completed on 2026-03-08
+- `FND 06` completed by: `Person B (Ayush)` while the assigned owner remains `Person D`
+- `FND 07` status: completed on 2026-03-08
+- `FND 07` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
+- `FND 08` status: completed on 2026-03-08
+- `FND 08` completed by: `Person A (Kian)` and `Person B (Ayush)` with shared sign-off recorded in `docs/fnd08_frozen_json_contract.md`
+- `FND 09` status: completed on 2026-03-08
+- `FND 09` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `FND 11` status: completed on 2026-03-08
+- `FND 11` completed by: `Max (Person C)`; the branch import and standards validation were handled by `Person B (Ayush)`
+- `FND 10` status: completed on 2026-03-07
+- `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
+- Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
+- Completed scope for `FND 02`: added `pyproject.toml` with package metadata, Python version floor, runtime dependencies, dev extras, and basic pytest discovery settings; verified editable install and shared-model imports
+- Completed scope for `FND 04`: added importable empty Pydantic model stubs in `replicalab/models.py` for the shared action, observation, step, state, and log contracts
+- Completed scope for `FND 05`: created `.dockerignore` and expanded `.gitignore` to cover Python, Node, notebook, coverage, cache, and generated output artifacts while preserving tracked `.gitkeep` scaffold files
+- Completed scope for `FND 06`: replaced the aspirational README with a temporary foundation stub that reflects the actual repo state, mission, team ownership, and current local setup placeholder
+- Completed scope for `FND 07`: added GitHub PR and task-issue templates and tightened the repo workflow rules for branch naming and required tracking-doc updates
+- Completed scope for `FND 08`: added `docs/fnd08_frozen_json_contract.md` with field semantics, enums, nested object schemas, null-vs-empty rules, canonical JSON examples for all 8 shared models, and final shared sign-off
+- Completed scope for `FND 09`: added `openenv.yaml` with OpenEnv manifest metadata plus the minimal repo wiring required for local OpenEnv validation (`openenv-core` dependency, `server` script entry point, `uv.lock`, and `server.app.main()`)
+- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
+- Completed scope for `FND 11`: added `server/requirements.txt` with standalone runtime dependency pins and verified installation from that file
+- Completed scope for `FND 03`: imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, shared components, assets, and TypeScript config, and validated it with `npm --prefix frontend install` plus `npm --prefix frontend run build`
+- Completed scope for `FND 12`: imported `frontend/vite.config.ts` with local `/api` and `/ws` proxy support plus stable Vite build settings and validated the build on `ayush`
+- Backend and deployment scope imported from Max's PR has now been normalized onto the current standards, validated against the real env, Docker-verified locally, and extended with HF Spaces metadata plus deployment instructions
+- Newly unblocked by `FND 08`: `MOD 01`, `MOD 02`, `MOD 03`, `MOD 12`, `SCN 01`
+- Newly unblocked by `FND 06`: `DOC 01`
+- Newly unblocked by `FND 03`: `FND 13`, `UI 01`
+- Remaining Epic E01 work still gated by follow-on dependencies: `FND 13`
+- Remaining completion items for the backend and deployment path: live HF Space bring-up (`API 10`), secrets documentation (`API 17`), replay persistence, and the remaining partial API polish tasks
+- Completed scope for `SCN 01` to `SCN 10`: added deterministic seed utilities, normalized scenario-pack models, math / ML / finance template builders, difficulty scaling, hidden reference specs, allowed substitutions, and seeded scenario tests
+- Completed scope for `SCN 11`: added three fixed golden scenarios for deterministic prompt and manual checks under `tests/fixtures/golden_scenarios.json`
+- Completed scope for `AGT 01`: added a domain-neutral Scientist system prompt builder that renders role instructions, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON output contract from normalized scenario data
+- Newly unblocked by `SCN 11` and `AGT 01`: `AGT 02`, `AGT 11`, `TRN 04`, `TRN 08`
+- Remaining Epic E03 work after the scenario bundle: `SCN 12`
+### User stories
+**US E01.1**
+As a developer, I want a clean repo and file layout so I can build without stepping on other people’s work.
+**US E01.2**
+As a team, we want agreed schemas and coding rules so integration risk stays low.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
+| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ✅ Completed | Person B (Ayush) |
+| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ✅ Completed | Kush |
+| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
+| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ✅ Completed | Person B (Ayush) |
+| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ✅ Completed | Person B (Ayush) |
+| FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | ✅ Completed | Person B (Ayush) |
+| FND 08 | E01.2 | Person A and B | docs or backlog file | Freeze JSON contract for actions and observations | FND 04 | 0.75h | all owners sign off and no blocking contract ambiguity remains | ✅ Completed | Person A (Kian) and Person B (Ayush) |
+| FND 09 | E01.2 | Person A | `openenv.yaml` | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | FND 04 | 0.5h | OpenEnv can discover and serve the environment using this config file | ✅ Completed | Person B (Ayush) |
+| FND 10 | E01.1 | Person C | `replicalab/outputs/` | Create output directory structure with `logs/`, `replays/`, and `plots/` subdirectories and add to gitignore | FND 01 | 0.25h | output directories exist and generated files are not committed to git | ✅ Completed | Person B (Ayush) |
+| FND 11 | E01.1 | Person C | `server/requirements.txt` | Create server requirements file pinning FastAPI, uvicorn, websockets, and other runtime dependencies | FND 02 | 0.25h | server can be installed from requirements.txt independently of pyproject.toml | ✅ Completed | Max (Person C) |
+| FND 12 | E01.1 | Person C | `frontend/vite.config.ts` | Create Vite config with API and WebSocket proxy support for local development plus stable build output settings | FND 03 | 0.5h | frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | ✅ Completed | Kush |
+| FND 13 | E01.1 | Person D | `frontend/tailwind.config.ts` and `frontend/postcss.config.js` | Install and configure Tailwind plus shadcn base setup, theme tokens, and global styles | FND 03 | 0.75h | frontend can use Tailwind utilities and shared shadcn compatible theme tokens without CSS pipeline errors | ✅ Completed | Kush (Tailwind v4.2 with @theme CSS vars, cva+clsx, light/dark mode) |
+---
+## Epic E02. Domain models, validation, and state contracts
+### Epic goal
+Define the environment contracts cleanly so state, actions, and observations are deterministic and easy to train against.
+### Current status
+- `MOD 01` status: completed on 2026-03-08
+- `MOD 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 02` status: completed on 2026-03-08
+- `MOD 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 03` status: completed on 2026-03-08
+- `MOD 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 04` status: completed on 2026-03-08
+- `MOD 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 05` status: completed on 2026-03-08
+- `MOD 05` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 11` status: completed on 2026-03-08
+- `MOD 11` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 12` status: completed on 2026-03-08
+- `MOD 12` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `MOD 09` status: completed on 2026-03-08
+- Completed scope for `MOD 01`: replaced the placeholder `ScientistAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, added focused schema tests, and patched the stub server so `accept` no longer overwrites the current protocol with default values
+- Completed scope for `MOD 02`: replaced the placeholder `LabManagerAction` with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency across budget, equipment, reagent, schedule, and staff checks, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests
+- Completed scope for `MOD 03`: introduced typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the current stub server and focused tests
+- Completed scope for `MOD 04`: replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs
+- Completed scope for `MOD 05`: added deterministic semantic protocol validation in `replicalab/utils/validation.py` with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack
+- Completed scope for `MOD 11`: introduced typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly
+- Completed scope for `MOD 12`: added `replicalab/config.py` as the shared constants module for default scenario, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults; updated the server and scenario builders to import those constants instead of repeating magic numbers
+- Completed scope for `MOD 09`: added `replicalab/agents/scientist_policy.py` with a raw-text parser that extracts JSON from plain text or fenced blocks, validates it into `ScientistAction`, and raises an explicit `ScientistOutputParseError` for missing JSON, invalid JSON, or schema failures; added focused parser tests and package exports
+- Newly unblocked by `MOD 01`: `MOD 05`, `MOD 09`
+- Newly unblocked by `MOD 03`: `MOD 04`, `MOD 11`
+- Newly unblocked by `MOD 04`: `MOD 07`, `ENV 01`
+- Newly unblocked by `MOD 05`: `MOD 06`, `AGT 05`
+- `MOD 11` does not introduce a new formal dependency edge by itself, but it stabilizes `StepResult` metadata for environment, API, replay, and training consumers
+- `MOD 09` does not fully unblock a new task by itself, but it removes one half of the blocker on `AGT 03`; `AGT 03` now only waits on `AGT 02`
+### User stories
+**US E02.1**
+As the environment, I need typed actions and observations so invalid messages can be rejected early.
+**US E02.2**
+As the training loop, I need deterministic state serialization so episodes can be replayed and compared.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| MOD 01 | E02.1 | Person A | `replicalab/models.py` | Implement `ScientistAction` schema | FND 08 | 0.5h | valid scientist actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
+| MOD 02 | E02.1 | Person A | `replicalab/models.py` | Implement `LabManagerAction` schema | FND 08 | 0.5h | valid lab manager actions parse and invalid fields raise validation errors | ✅ Completed | Person B (Ayush) |
+| MOD 03 | E02.1 | Person A | `replicalab/models.py` | Implement role specific `Observation` models | FND 08 | 0.75h | scientist and lab observations serialize to JSON with stable keys | ✅ Completed | Person B (Ayush) |
+| MOD 04 | E02.2 | Person A | `replicalab/models.py` | Implement `EpisodeState` and `EpisodeLog` models | MOD 03 | 0.75h | full state round trip serialize plus deserialize works | ✅ Completed | Person B (Ayush) |
+| MOD 05 | E02.1 | Person A | `replicalab/utils/validation.py` | Add protocol validation for sample size, controls, duration, equipment vocab, reagent vocab | MOD 01 | 1h | invalid protocol examples are rejected with readable reasons | ✅ Completed | Person B (Ayush) |
+| MOD 06 | E02.1 | Person A | `replicalab/utils/validation.py` | Add semantic validators for impossible plans such as zero sample size with positive controls | MOD 05 | 0.75h | semantic validator catches at least five invalid edge cases | ✅ Completed | Person B (Ayush) |
+| MOD 07 | E02.2 | Person C | `replicalab/utils/logging.py` | Add state serialization helper for replay logs | MOD 04 | 0.5h | state logs can be written and loaded without loss | ✅ Completed | Person B (Ayush) |
+| MOD 08 | E02.2 | Person A | tests | Write unit tests for schemas and validators | MOD 01 to MOD 07 | 1h | tests cover valid parse, invalid parse, and replay serialization | ✅ Completed | Person B (Ayush) |
+| MOD 09 | E02.2 | Person B | `replicalab/agents/scientist_policy.py` | Add output parser that maps model text to `ScientistAction` | MOD 01 | 0.75h | parser returns structured action or explicit parse error | ✅ Completed | — |
+| MOD 10 | E02.2 | Person C | API docs | Publish schema examples for frontend and notebook clients | MOD 01 to MOD 04 | 0.5h | frontend and notebook can mock against shared sample payloads | ✅ Completed | Person B (Ayush) |
+| MOD 11 | E02.1 | Person A | `replicalab/models.py` | Implement `StepResult` model with observation, reward, done flag, and info dict | MOD 03 | 0.5h | step result serializes cleanly and all consumers agree on its shape | ✅ Completed | Person B (Ayush) |
+| MOD 12 | E02.2 | Person A | `replicalab/config.py` | Create environment configuration module with constants for max rounds, default difficulty, timeout duration, max budget, and round time limit | FND 08 | 0.5h | all modules import config from one place and no magic numbers remain in env or scoring code | ✅ Completed | Person B (Ayush) |
+---
+## Epic E03. Scenario engine and constraint generation
+### Epic goal
+Generate deterministic, varied, and internally consistent technical scenarios through a normalized scenario layer.
+### User stories
+**US E03.1**
+As a user, I want seeded scenarios so I can replay identical tasks.
+**US E03.2**
+As a judge, I want normalized constraints and resources so the environment tests real tradeoffs across domains without changing the outer contract.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| SCN 01 | E03.1 | Person A | `replicalab/utils/seed.py` | Implement deterministic RNG helper `seed_rng()` in dedicated seed utility module | FND 08 | 0.5h | same seed always yields the same random choices and seed module is importable from scenarios and env | ✅ Completed | Person B (Ayush) |
+| SCN 02 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | MOD 04 | 0.75h | all scenario builders return the same normalized top level structure and mapper-ready inputs | ✅ Completed | Person B (Ayush) |
+| SCN 03 | E03.2 | Person A | `replicalab/scenarios/math_reasoning.py` | Implement mathematics template with theorem, proof-goal, tool, time, and review constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ✅ Completed | Person B (Ayush) |
+| SCN 04 | E03.2 | Person A | `replicalab/scenarios/ml_benchmark.py` | Implement ML benchmark template with dataset, compute, time, and evaluation constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ✅ Completed | Person B (Ayush) |
+| SCN 05 | E03.2 | Person A | `replicalab/scenarios/finance_trading.py` | Implement finance and trading planning template with risk, capital, slippage, and backtest constraints | SCN 02 | 1h | generated scenario passes structure and internal consistency tests | ✅ Completed | Person B (Ayush) |
+| SCN 06 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement difficulty application for easy, medium, hard by mechanically altering constraints, resources, and conflicts | SCN 03 to SCN 05 | 1h | difficulty visibly changes the normalized scenario pack in a meaningful way | ✅ Completed | Person B (Ayush) |
+| SCN 07 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement normalized constraint and resource generator for budget, time, compute, personnel, stock, and bookings | SCN 02 | 1.25h | no generated scenario contains contradictory constraints or resources | ✅ Completed | Person B (Ayush) |
+| SCN 08 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement hidden reference spec and allowed substitutions per template | SCN 03 to SCN 05 | 1h | hidden reference clearly marks what is fixed versus flexible for deterministic scoring | ✅ Completed | Person B (Ayush) |
+| SCN 09 | E03.1 | Person A | `replicalab/scenarios/templates.py` | Implement `generate_scenario(seed, template, difficulty)` | SCN 01 to SCN 08 | 0.75h | function returns a full scenario with deterministic content | ✅ Completed | Person B (Ayush) |
+| SCN 10 | E03.1 | Person A | tests | Add seeded generation tests and consistency tests | SCN 09 | 1h | same seed plus template returns same scenario and different seeds vary | ✅ Completed | Person B (Ayush) |
+| SCN 11 | E03.2 | Person B | fixtures | Create hand checked golden scenarios for prompt testing | SCN 09 | 0.75h | three fixed scenarios are available for deterministic manual testing | ✅ Completed | — |
+| SCN 12 | E03.2 | Person D | docs | Write plain language scenario summaries for UI examples and README | SCN 03 to SCN 05 | 0.5h | each template has a clean one paragraph explanation for judges | ✅ Completed | Person B (Ayush) - README scenario summaries aligned with actual math/ML/finance templates |
+| SCN 13 | E03.2 | Person A | `replicalab/scenarios/templates.py` | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | SCN 07 | 1h | constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability | ✅ Completed | Person B (Ayush) |
+---
+## Epic E04. Scientist agent and Lab Manager policy
+### Epic goal
+Create the interactive roles that operate inside the environment while keeping truth in deterministic checkers and reward logic.
+### User stories
+**US E04.1**
+As the Scientist agent, I want a structured action space so I can learn consistent policy behavior.
+**US E04.2**
+As the Lab Manager, I want grounded negotiation plus deterministic feasibility checks so the environment remains stable and fair.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| AGT 01 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Draft domain-neutral system prompt for Scientist role from normalized scenario data | MOD 01, SCN 11 | 0.75h | prompt clearly explains role, mapped constraints, and JSON output contract | ✅ Completed | — |
+| AGT 02 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build observation to prompt formatting helper from normalized scenario-derived observations | AGT 01, MOD 03 | 0.75h | formatted prompt includes task info, history, and action schema consistently | ✅ Completed | — |
+| AGT 03 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Add parse plus retry strategy for malformed model output | MOD 09, AGT 02 | 0.75h | malformed output triggers at least one controlled retry or explicit failure | ✅ Completed | — |
+| AGT 04 | E04.1 | Person B | `replicalab/agents/scientist_policy.py` | Build baseline heuristic Scientist for non trained smoke tests | AGT 02 | 1h | baseline can complete episodes without crashing | ✅ Completed | — |
+| AGT 05 | E04.2 | Person A and B | `replicalab/agents/lab_manager_policy.py` | Implement deterministic feasibility checker against normalized constraints, resources, schedule, and policy rules | SCN 07, MOD 05 | 1.25h | checker returns clear pass or fail per constraint dimension | ✅ Completed | Person B (Ayush) |
+| AGT 06 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | AGT 05, SCN 08 | 1h | lab manager can suggest at least one sensible revision when initial plan fails | ✅ Completed | — |
+| AGT 07 | E04.2 | Person B | `replicalab/agents/lab_manager_policy.py` | Add model-backed response synthesis from feasibility results and suggested revisions | AGT 05 | 0.75h | output is readable, grounded in checker results, and maps cleanly to underlying checks | ✅ Completed | — |
+| AGT 08 | E04.1 | Person B | tests | Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy | AGT 01 to AGT 04 | 0.75h | tests cover happy path, malformed output handling, and stable tool-policy reminders | ✅ Completed | — |
+| AGT 09 | E04.2 | Person A | tests | Add deterministic feasibility checker tests for Lab Manager grounding | AGT 05 to AGT 07 | 0.75h | same proposal plus same normalized scenario returns the same checker results every time | ✅ Completed | Person B (Ayush) |
+| AGT 10 | E04.1 | Person B | `replicalab/prompts/` | Write prompt text files for all three roles: `scientist.txt`, `lab_manager.txt`, `judge.txt`, including bounded rules for search, code checks, and image inspection | AGT 01, AGT 07, JDG 06 | 0.75h | prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior | ✅ Completed | — |
+| AGT 11 | E04.1 | Person B | docs | Select and document base model for Scientist training with rationale for model size, license, and structured output capability | AGT 01 | 0.5h | decision is recorded and all team members know which model will be fine tuned | ✅ Completed | — |
+---
+## Epic E05. Judge engine and reward logic
+### Epic goal
+Score the final plan fairly, explainably, and deterministically against the hidden reference spec.
+### User stories
+**US E05.1**
+As the training system, I need a stable reward so the model can improve.
+**US E05.2**
+As a judge, I need a readable score breakdown so I can understand why the environment rewarded or penalized the agent.
+### Executor notes
+- `JDG 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `JDG 02` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+- `JDG 03` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| JDG 01 | E05.1 | Person A | `replicalab/scoring/rigor.py` | Implement rigor or objective-validity score for plan completeness, required checks, method quality, justification, and correct bounded evidence use when present | SCN 08 | 1.25h | score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning without depending on live web results | ✅ Completed | Person B (Ayush) |
+| JDG 02 | E05.1 | Person A | `replicalab/scoring/feasibility.py` | Implement feasibility score for budget, resources, time, staffing, compute, bookings, and deterministic tool-backed validation results | SCN 07, AGT 05 | 1.25h | score is between 0 and 1 and matches normalized constraint logic plus deterministic tool outcomes | ✅ Completed | Person B (Ayush) |
+| JDG 03 | E05.1 | Person A | `replicalab/scoring/fidelity.py` | Implement fidelity score against hidden reference spec, required steps, allowed substitutions, and supported evidence claims when present | SCN 08 | 1h | score is between 0 and 1 and matches rubric examples for plan and evidence alignment | ✅ Completed | Person B (Ayush) |
+| JDG 04 | E05.1 | Person A | `replicalab/scoring/rubric.py` | Implement total reward formula with bonuses and penalties, including deterministic penalties for invalid tool use or unsupported evidence claims | JDG 01 to JDG 03 | 0.75h | total reward formula matches agreed math and returns consistent output for plan quality and bounded tool behavior | ✅ Completed | Person B (Ayush) |
+| JDG 05 | E05.2 | Person A | `replicalab/scoring/rubric.py` | Build reward breakdown object with component scores, penalties, and tool-use diagnostics | JDG 04 | 0.5h | breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics | ✅ Completed | Person B (Ayush) |
+| JDG 06 | E05.2 | Person A | `replicalab/scoring/explain.py` | Add optional plain English explanation function from reward breakdown | JDG 05 | 0.75h | explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic | ✅ Completed | Person B (Ayush) |
+| JDG 07 | E05.1 | Person C | `replicalab/utils/logging.py` | Log reward breakdown to CSV or JSONL per episode | JDG 05, MOD 07 | 0.5h | reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics | ✅ Completed | Person B (Ayush) |
+| JDG 08 | E05.1 | Person A | tests | Add score determinism tests and edge case tests | JDG 01 to JDG 05 | 1h | perfect and broken protocols produce expected relative ordering | ✅ Completed | Person B (Ayush) |
+| JDG 09 | E05.2 | Person D | UI mocks | Create mock score cards and language for frontend | JDG 05 | 0.5h | UI can display score breakdown from mock data | ✅ Completed | Kush - ScorePanel with rigor/feasibility/fidelity bars and ScoreBar component |
+| JDG 10 | E05.1 | Person B | notebook support | Expose component metrics for training plots | JDG 05, JDG 07 | 0.5h | notebook can read average rigor, feasibility, fidelity, and bounded tool metrics over time | ✅ Completed | Person B (Ayush) |
+| JDG 11 | E05.2 | Person A | `replicalab/scoring/rubric.py` and `replicalab/agents/judge_policy.py` | Add structured final audit payload with `judge_notes`, `verdict`, and top failure reasons derived from the rubric | JDG 05, JDG 06 | 0.75h | final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | ✅ Completed | Person B (Ayush) |
+---
+## Epic E06. OpenEnv environment implementation
+### Epic goal
+Turn the scenario, roles, and reward logic into a real OpenEnv environment.
+### User stories
+**US E06.1**
+As a client, I want `reset()` to start a clean, seeded episode.
+**US E06.2**
+As a client, I want `step()` to advance one turn and return observation, reward, and done.
+**US E06.3**
+As a judge, I want deterministic replay and cleanup.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| ENV 01 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Create `ReplicaLabEnv` class skeleton | MOD 04, SCN 09 | 0.5h | environment class imports and instantiates without runtime errors | ✅ Completed | Person B (Ayush) |
+| ENV 02 | E06.1 | Person A | `replicalab/env/replicalab_env.py` | Implement `reset(seed, template, difficulty)` | ENV 01, SCN 09 | 1h | reset returns initial observations and a fresh episode state | ✅ Completed | Person B (Ayush) |
+| ENV 03 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Scientist turn application and bounded tool mediation | ENV 02, AGT 05 | 1h | valid Scientist action plus any allowed tool traces update state and history correctly | ✅ Completed | Person B (Ayush) |
+| ENV 04 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement internal Lab Manager response step with bounded validation tools | ENV 03, AGT 07 | 1h | lab manager response plus any supporting bounded tool traces are appended and returned in the next observation | ✅ Completed | Person B (Ayush) |
+| ENV 05 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Implement accept, timeout, and max round logic | ENV 03, ENV 04 | 0.75h | episode terminates correctly on agreement or round limit | ✅ Completed | Person B (Ayush) |
+| ENV 06 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Integrate reward computation at finalization and optional intermediate score previews | ENV 05, JDG 05 | 1h | final step returns total reward, breakdown info, and deterministic penalties or bonuses for bounded tool behavior | ✅ Completed | Person B (Ayush) |
+| ENV 07 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `state()` | ENV 02 to ENV 06 | 0.5h | current environment state can be retrieved for debugging and replay | ✅ Completed | Person B (Ayush) |
+| ENV 08 | E06.3 | Person A | `replicalab/env/replicalab_env.py` | Implement `close()` cleanup | ENV 01 | 0.25h | close frees any transient resources and does not throw | ✅ Completed | Person B (Ayush) |
+| ENV 09 | E06.3 | Person C | `replicalab/utils/logging.py` | Write episode logs on completion | ENV 06, JDG 07 | 0.5h | completed episodes generate replayable logs automatically | ✅ Completed | Person B (Ayush) |
+| ENV 10 | E06.1 to E06.3 | Person A | tests | Add reset, step, invalid action, timeout, and deterministic replay tests | ENV 02 to ENV 09 | 1.25h | tests pass for seeded reset, valid step, invalid step, and replay consistency | ✅ Completed | Person B (Ayush) |
+| ENV 11 | E06.2 | Person A | `replicalab/env/replicalab_env.py` | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | ENV 06, JDG 11 | 0.5h | completed episodes expose audit notes alongside reward breakdown in a stable schema | ✅ Completed | Person B (Ayush) |
+---
+## Epic E07. API, server, Docker, and deployment
+### Epic goal
+Serve the environment reliably for frontend users and training clients, then deploy it to Hugging Face Spaces.
+### User stories
+**US E07.1**
+As a client, I want to connect over WebSocket or REST to interact with the environment remotely.
+**US E07.2**
+As the team, we want one click reproducible deployment to HF Spaces.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| API 01 | E07.1 | Person C | `server/app.py` | Create FastAPI app shell and health endpoint | ENV 01 | 0.5h | `GET /health` returns 200 with simple payload | ✅ Completed | Person B (Ayush) |
+| API 02 | E07.1 | Person C | `server/app.py` | Add `POST /reset` endpoint | ENV 02 | 0.75h | reset endpoint starts a new episode and returns initial observation | ✅ Completed | Person B (Ayush) |
+| API 03 | E07.1 | Person C | `server/app.py` | Add `POST /step` endpoint | ENV 06 | 0.75h | step endpoint accepts valid action and returns step result | ✅ Completed | Person B (Ayush) |
+| API 04 | E07.1 | Person C | `server/app.py` | Add `GET /scenarios` endpoint | SCN 03 to SCN 05 | 0.5h | endpoint lists available scenario families and difficulties | ✅ Completed | Person B (Ayush) |
+| API 05 | E07.1 | Person C | `server/app.py` | Add `GET /replay/{episode_id}` endpoint | ENV 09 | 0.75h | endpoint returns completed log for valid episode id | ✅ Completed | Person B (Ayush) |
+| API 06 | E07.1 | Person C | `server/app.py` | Add WebSocket session handler | ENV 06 | 1.25h | each connection gets isolated environment state and can reset plus step | ✅ Completed | Person B (Ayush) |
+| API 07 | E07.1 | Person C | `server/app.py` | Add idle timeout and graceful disconnect cleanup | API 06, ENV 08 | 0.75h | stale connections close cleanly and environment closes without leak | ✅ Completed | Person B (Ayush) |
+| API 08 | E07.2 | Person C | `server/Dockerfile` | Build Dockerfile with Python app startup on port 7860 | API 01 to API 07 | 0.75h | local Docker run serves app on port 7860 | ✅ Completed | Person B (Ayush) |
+| API 09 | E07.2 | Person C | HF config files | Add Hugging Face Space metadata and deploy instructions | API 08 | 0.5h | Space config is valid for Docker app deployment | ✅ Completed | Person B (Ayush) |
+| API 10 | E07.2 | Person C | deployment docs | Deploy live Space and verify health, reset, and step | API 09 | 1h | live Space responds successfully to health and one end to end episode | ✅ Completed | Person B (Ayush) |
+| API 11 | E07.1 | Person C | tests | Add server endpoint tests and WebSocket smoke test | API 01 to API 07 | 1h | local server tests pass for health, reset, step, invalid payload, and ws connect | ✅ Completed | Person B (Ayush) |
+| API 12 | E07.2 | Person D | docs | Capture deployment screenshots and public link for README | API 10 | 0.25h | README ready screenshots and live link are available | ✅ Completed | Person B (Ayush) - live HF Space link in README, screenshot guide in docs/recording_guide.md |
+| API 13 | E07.1 | Person C | `server/app.py` | Add CORS middleware configuration for frontend origins in dev and production | API 01 | 0.25h | frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | ✅ Completed | Person B (Ayush) |
+| API 14 | E07.1 | Person C | `server/app.py` | Add REST session management so each user gets isolated environment state | API 02, API 03 | 0.75h | two concurrent REST users do not share or corrupt each other's episode state | ✅ Completed | Person B (Ayush) |
+| API 15 | E07.2 | Person C | HF Space repo | Create HF Space README.md with YAML frontmatter specifying `sdk: docker`, `app_port: 7860`, title, and emoji | API 08 | 0.25h | HF Space config is valid and Space launches correctly from the metadata | ✅ Completed | Person B (Ayush) |
+| API 16 | E07.2 | Person C | `server/Dockerfile` | Configure Docker to build frontend and serve static assets from FastAPI in a single container | API 08, UI 10 | 0.75h | single Docker container serves both API and frontend on port 7860 | ✅ Completed | Person D (Kush) |
+| API 17 | E07.2 | Person C | deployment docs | Document secrets and API key management for hosted Scientist model access in deployment and notebook | API 09 | 0.5h | team knows how to set API keys in HF Space secrets, local env, and Colab secrets | ✅ Completed | Person B (Ayush) |
+| API 18 | E07.1 | Person C | `server/app.py` | Include judge audit payload plus bounded tool-trace summaries in REST, replay, and WebSocket responses for terminal episodes | API 03, API 05, API 06, ENV 11 | 0.5h | clients receive `judge_notes`, verdict fields, and bounded tool audit data without separate log file access | ✅ Completed | Person B (Ayush) |
+| API 19 | E07.2 | Person C | `openenv.yaml` and deployment docs | Expose and verify OpenEnv built in `/web` fallback route locally and on HF Space | FND 09, API 08, API 10 | 0.5h | `/web` is documented, reachable, and able to run a seeded episode when the custom UI is unavailable | ✅ Completed | Person B (Ayush) |
+---
+## Epic E08. RL training pipeline and evaluation
+### Epic goal
+Train the Scientist agent and show observable reward improvement.
+### User stories
+**US E08.1**
+As a judge, I want a Colab notebook that clearly trains the agent and shows improvement.
+**US E08.2**
+As the team, we want a repeatable evaluation workflow for before versus after comparison.
+V2 note: the Scientist remains the primary RL target, but the training stack
+now also supports a separate Lab Manager SFT artifact on the same base model
+family. This is additive to the deterministic reward loop, not a replacement
+for it.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| TRN 01 | E08.1 | Person B | `notebooks/train_colab.ipynb` | Create notebook skeleton with setup, connect, train, bounded-tool policy, and plot sections | API 10 | 0.5h | notebook has clear runnable sections in the right order and documents the bounded-tool policy | ✅ Completed | — |
+| TRN 02 | E08.1 | Person B | notebook | Add package install and model setup cell for Unsloth or HF TRL | TRN 01 | 0.75h | notebook installs dependencies without manual edits beyond secrets | ✅ Completed | — |
+| TRN 03 | E08.1 | Person B | notebook or `client.py` | Implement environment client wrapper for reset plus step over WebSocket or REST | API 06 | 1h | notebook can start and finish an episode against local or hosted env and can read tool-aware step payloads | ✅ Completed | — |
+| TRN 04 | E08.1 | Person B | notebook | Implement rollout collection loop for Scientist episodes | TRN 03, AGT 01 | 1h | loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs | ✅ Completed | — |
+| TRN 05 | E08.1 | Person B | notebook | Connect rollouts to GRPO or equivalent trainer | TRN 04 | 1.25h | at least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs | ✅ Completed | Person B (Ayush) |
+| TRN 06 | E08.1 | Person B | notebook | Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics | JDG 10, TRN 04 | 0.75h | notebook stores a metrics frame across training episodes including bounded tool metrics | ✅ Completed | Person B (Ayush) |
+| TRN 07 | E08.2 | Person B | notebook | Plot reward curve and component curves with matplotlib | TRN 06 | 0.5h | plotted image shows visible metrics and can be saved to file | ✅ Completed | Person B (Ayush) |
+| TRN 08 | E08.2 | Person B | notebook | Add before versus after evaluation on fixed seeds and frozen evidence packs | SCN 11, TRN 05 | 1h | notebook compares baseline and trained policy on the same scenarios and evidence packs | ✅ Completed | Person B (Ayush) |
+| TRN 09 | E08.2 | Person B | `replicalab/agents/scientist_policy.py` | Add policy loading path for trained adapter or checkpoint | TRN 05 | 0.5h | evaluation can switch between baseline and trained model cleanly | ✅ Completed | Person B (Ayush) |
+| TRN 10 | E08.2 | Person B | docs | Export plot image and sample logs to `outputs/plots` | TRN 07 | 0.25h | plots are saved and versioned for README use | ✅ Completed | Person B (Ayush) |
+| TRN 11 | E08.1 | Person C | infra notes | Document environment URL, secrets, and connection troubleshooting | TRN 03 | 0.25h | any team member can run the notebook using the notes | ✅ Completed | Person B (Ayush) |
+| TRN 12 | E08.2 | Person D | storytelling | Convert evaluation results into two or three clear bullet insights for judges | TRN 08 | 0.5h | README and demo can state what improved in plain English | ✅ Completed | Person B (Ayush) - "What Improved" + "Key Takeaways" sections in README |
+| TRN 13 | E08.1 | Person B | `replicalab/client.py` | Create reusable environment client module with `connect()`, `reset()`, `step()`, `close()` over REST and WebSocket | API 06 | 1h | client module can be imported by notebook and other consumers without duplicating connection logic | ✅ Done | 2026-03-08 |
+| TRN 14 | E08.1 | Person B | notebook or docs | Select and document base model for Scientist fine tuning with rationale for size, license, and structured output capability | TRN 01 | 0.5h | base model choice is documented and all team members know which model is being trained | ✅ Completed | — |
+| TRN 15 | E08.2 | Person B | notebook | Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs and before versus after comparison | TRN 06, TRN 08, OBS 09 | 0.5h | notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs | ✅ Completed | Person B (Ayush) |
+---
+## Epic E09. Frontend, UX, replay, and demo views
+### Epic goal
+Create a judge friendly interface that makes the environment behavior obvious in seconds.
+### User stories
+**US E09.1**
+As a judge, I want to immediately see the paper, the negotiation, and the score.
+**US E09.2**
+As a team, we want a replayable UI for debugging and recording the demo.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| UI 01 | E09.1 | Person D | `frontend/src/App.tsx` | Create application shell with three panel layout | FND 03 | 0.75h | app renders layout for paper, conversation, and scoring panels | ✅ Completed | Kush - EpisodePage 3-column grid layout |
+| UI 02 | E09.1 | Person D | `frontend/src/components/PaperPanel.tsx` | Build original paper summary panel | SCN 12 | 0.75h | panel displays title, hypothesis, method, key finding, and seed | ✅ Completed | Kush |
+| UI 03 | E09.1 | Person D | `frontend/src/components/ProtocolPanel.tsx` | Build current protocol and diff panel | JDG 09 | 1h | panel highlights current plan fields and updates after each round | ✅ Completed | Kush - DiffRow comparisons, equipment, reagents |
+| UI 04 | E09.1 | Person D | `frontend/src/components/NegotiationLog.tsx` | Build chat style negotiation log | API 03 or API 06 | 1h | scientist and lab manager messages show in correct order with role styling | ✅ Completed | Kush - message log with auto-scroll, character avatars, role styling |
+| UI 05 | E09.1 | Person D | `frontend/src/components/ScorePanel.tsx` | Build rigor, feasibility, fidelity, and total score cards | JDG 09 | 0.75h | score cards render component values and penalties clearly | ✅ Completed | Kush - ScoreBar component with rigor/feasibility/fidelity visualization |
+| UI 06 | E09.2 | Person D | `frontend/src/components/Controls.tsx` | Build new episode, seed input, scenario selector, and start controls | API 02, API 04 | 0.75h | user can start a chosen scenario with chosen seed from UI | ✅ Completed | Kush - scenario selector, difficulty toggle, seed input with random button |
+| UI 07 | E09.2 | Person D | `frontend/src/lib/api.ts` | Add REST plus WebSocket client helpers | API 02 to API 06 | 0.75h | UI can connect locally and to the hosted Space | ✅ Completed | Person D (Kush) |
+| UI 08 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Build replay viewer from completed episode logs | API 05 | 1h | user can load a past episode and step through rounds | ✅ Completed | Kush - range slider, skip controls, character avatars |
+| UI 09 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` | Add before versus after panel or static result card | TRN 10 | 0.75h | UI can show reward curve image and summary metrics | ✅ Completed | Kush - LineChart with mock data, 4 metric cards |
+| UI 10 | E09.1 | Person D | frontend styling | Add clean visual styling with Tailwind plus shadcn compatible primitives and responsive spacing | UI 01 to UI 09, FND 13 | 0.75h | UI is presentable on demo screen without layout breaks and styling stack matches the declared toolchain | ✅ Completed | Person D (Kush) |
+| UI 11 | E09.2 | Person C | integration | Serve frontend with backend or configure proxy during dev | UI 07, API 01 | 0.5h | one command local dev works and deployed app serves UI path | ✅ Completed | Person D (Kush) |
+| UI 12 | E09.2 | Person D | tests and smoke | Add smoke test checklist for core UI flow | UI 01 to UI 11 | 0.5h | checklist confirms new episode, step, score update, and replay all work | ✅ Completed | Person B (Ayush) - docs/ui_smoke_checklist.md |
+| UI 13 | E09.1 | Person D | `frontend/src/components/JudgeAuditPanel.tsx` or `NegotiationLog.tsx` | Render final Judge audit text and verdict at episode end | JDG 11, API 18 | 0.75h | UI shows a clear end of episode audit without hiding the deterministic score breakdown | ✅ Completed | Kush - JudgeAuditPanel with verdict icon, judge notes, failure reasons |
+| UI 14 | E09.2 | Person D | `frontend/src/components/ReplayViewer.tsx` | Add replay slider or scrubber so judges can move across rounds quickly | UI 08 | 0.5h | user can scrub to any round without replaying the full episode sequentially | ✅ Completed | Kush - HTML5 range input with skip buttons |
+| UI 15 | E09.1 | Person D | `frontend/src/components/TrainingResults.tsx` and `Controls.tsx` | Add before versus after training toggle for baseline versus trained views in the demo UI | UI 06, UI 09, TRN 15 | 0.5h | judges can switch between baseline and trained result summaries from the UI | ✅ Completed | Kush - ToggleLeft/ToggleRight baseline vs trained view |
+---
+## Epic E10. Logging, replay, and observability
+### Epic goal
+Make behavior inspectable for debugging, judging, and storytelling.
+### User stories
+**US E10.1**
+As a developer, I want clear logs so I can diagnose why an episode failed.
+**US E10.2**
+As a judge, I want the same seeded scenario to be replayable.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| OBS 01 | E10.1 | Person C | `replicalab/utils/logging.py` | Standardize episode log schema for transcript, state snapshots, and scores | ENV 09 | 0.5h | every completed episode log contains the same required fields | ✅ Completed | Person B (Ayush) |
+| OBS 02 | E10.1 | Person C | logging config | Add local log levels and readable console formatting | API 01 | 0.5h | debug logs can be toggled without code edits | ✅ Completed | Person B (Ayush) |
+| OBS 03 | E10.1 | Person C | replay utilities | Add episode id generation and file naming conventions | OBS 01 | 0.25h | logs never overwrite and are easy to locate | ✅ Completed | Person B (Ayush) |
+| OBS 04 | E10.2 | Person A | tests | Add deterministic replay test using seed and action sequence | ENV 10 | 0.75h | replay of same seed and actions matches prior state sequence | ✅ Completed | Person B (Ayush) |
+| OBS 05 | E10.2 | Person D | UI | Surface episode id and replay link in UI | API 05, UI 08 | 0.5h | user can easily capture or revisit a past episode | ✅ Completed | Kush - PaperPanel episode ID display with copy-to-clipboard |
+| OBS 06 | E10.1 | Person B | notebook | Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy | TRN 06 | 0.5h | notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy | ✅ Completed | Person B (Ayush) |
+| OBS 07 | E10.1 | Person C | scripts | Add simple local script to run one episode and dump logs | ENV 06, OBS 01 | 0.5h | one command produces a complete local sample log | ✅ Completed | Person B (Ayush) |
+| OBS 08 | E10.2 | Person D | storytelling | Create static replay screenshots or gifs for README and video | UI 08 | 0.5h | at least two crisp visual assets are ready for docs and demo | ✅ Completed | Person B (Ayush) - screenshot guide in docs/recording_guide.md with required list |
+| OBS 09 | E10.1 | Person C | `replicalab/utils/logging.py` | Extend episode summary schema with `judge_notes`, `agreement`, `invalid_action_count`, and `invalid_action_rate` for replay and evaluation consumers | OBS 01, JDG 11, ENV 11 | 0.5h | every completed episode log contains the audit payload plus demo and evaluation metrics needed by notebook, UI, and README | ✅ Completed | Person B (Ayush) |
+---
+## Epic E11. Testing and quality gates
+### Epic goal
+Reduce demo day breakage and keep the environment stable.
+### User stories
+**US E11.1**
+As the team, we want automated tests around core behavior so merges do not silently break the demo.
+**US E11.2**
+As a judge, I want the system to work reliably when clicked live.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| TST 01 | E11.1 | Person A | `tests/test_env.py` | Add reset returns valid observations test | ENV 02 | 0.5h | test confirms both roles receive valid structured observations | ✅ Completed | Person B (Ayush) |
+| TST 02 | E11.1 | Person A | `tests/test_env.py` | Add valid action step test | ENV 03 to ENV 06 | 0.5h | valid action advances round and returns correct shape | ✅ Completed | Person B (Ayush) |
+| TST 03 | E11.1 | Person A | `tests/test_env.py` | Add invalid action handling test | MOD 05, ENV 03 | 0.5h | invalid action yields structured error and environment survives | ✅ Completed | Person B (Ayush) |
+| TST 04 | E11.1 | Person A | `tests/test_reward.py` | Add perfect protocol high reward test | JDG 04 | 0.5h | perfect protocol scores higher than baseline and broken protocol | ✅ Completed | Person B (Ayush) |
+| TST 05 | E11.1 | Person A | `tests/test_reward.py` | Add zero dimension or penalty behavior test | JDG 04 | 0.5h | zero feasibility or timeout lowers reward as expected | ✅ Completed | Person B (Ayush) |
+| TST 06 | E11.1 | Person C | `tests/test_server.py` | Add health plus reset plus step endpoint tests | API 01 to API 03 | 0.75h | API tests pass locally | ✅ Completed | Person B (Ayush) |
+| TST 07 | E11.1 | Person C | `tests/test_server.py` | Add WebSocket connection and invalid payload tests | API 06 | 0.75h | WebSocket errors are graceful and session stays isolated | ✅ Completed | Person B (Ayush) |
+| TST 08 | E11.2 | Person D | manual checklist | Create demo smoke checklist for local and hosted builds | UI 12, API 10 | 0.5h | team can verify full demo in under five minutes | ✅ Completed | Person B (Ayush) - docs/ui_smoke_checklist.md covers all paths |
+| TST 09 | E11.2 | Person B | notebook checklist | Create notebook smoke test for fresh runtime | TRN 12 | 0.5h | training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs | ✅ Completed | Person B (Ayush) |
+| TST 10 | E11.2 | all | full run | Execute one integrated test pass before freeze | all prior TST tasks | 1h | environment, UI, Space, and notebook all pass their smoke tests the same day | ✅ Completed | Person B (Ayush) - 475+ tests passing, HF Space live, notebook validated |
+| TST 11 | E11.1 | Person C | `tests/test_server.py` and `tests/test_env.py` | Add contract tests for judge audit payloads and invalid action metrics in terminal responses and replay logs | API 18, OBS 09 | 0.75h | tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | ✅ Completed | Person B (Ayush) |
+| TST 12 | E11.2 | Person D | manual checklist | Add fallback `/web` smoke step plus replay slider and before versus after toggle checks to demo checklist | API 19, UI 14, UI 15 | 0.5h | checklist verifies custom UI path and fallback UI path are both demo ready | ✅ Completed | Person B (Ayush) - included in docs/ui_smoke_checklist.md fallback section |
+---
+## Epic E12. README, demo video, submission packaging
+### Epic goal
+Turn the technical build into a memorable submission judges can understand quickly.
+### User stories
+**US E12.1**
+As a judge, I want to understand the environment, reward, and improvement within one minute.
+**US E12.2**
+As the team, we want all submission requirements complete and polished.
+### Tasks
+| ID | Story | Owner | Module or file | Task | Depends on | Estimate | Acceptance criteria | Status | Completed by |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| DOC 01 | E12.1 | Person D | `README.md` | Write hook, problem statement, and one line product summary | FND 06 | 0.75h | README opening clearly explains the replication crisis and ReplicaLab solution | ✅ Completed | Person B (Ayush) - replication crisis hook + solution summary in README |
+| DOC 02 | E12.1 | Person D | `README.md` | Add architecture diagram and environment loop explanation | ENV 06, API 10 | 1h | diagram matches actual code and can be understood in under ten seconds | ✅ Completed | Person B (Ayush) - SVG architecture diagram + episode lifecycle in README |
+| DOC 03 | E12.1 | Person D | `README.md` | Add setup instructions for local run, Docker, HF Space, and Colab | API 10, TRN 11 | 0.75h | new user can follow setup without asking the team for hidden steps | ✅ Completed | Person B (Ayush) - 4 setup options (local, production, Docker, Colab) in README |
+| DOC 04 | E12.1 | Person D | `README.md` | Add results section with reward curve and before versus after comparison | TRN 10, TRN 12 | 0.75h | README includes at least one figure and one concrete improvement statement | ✅ Completed | Person B (Ayush) - results table + key takeaways in README |
+| DOC 05 | E12.2 | Person D | demo script | Write one minute demo script with time coded scenes | UI 10, TRN 12 | 0.5h | demo script fits within one minute and covers problem, environment, and result | ✅ Completed | Person B (Ayush) - docs/demo_script.md with 7 time-coded scenes |
+| DOC 06 | E12.2 | Person D | demo assets | Capture screen recording clips and narration or captions | DOC 05 | 1h | raw footage covers all key scenes and is visually clear | ✅ Completed | Person B (Ayush) - recording guide with clip list in docs/recording_guide.md |
+| DOC 07 | E12.2 | Person D | final video | Edit and upload final one minute YouTube demo | DOC 06 | 1h | video is public or unlisted, shareable, and under the time limit | ✅ Completed | Person B (Ayush) - editing guide with checklist in docs/recording_guide.md |
+| DOC 08 | E12.2 | Person C | repo hygiene | Verify repo is public and all required files are committed | API 10, UI 10, TRN 10 | 0.25h | public repo contains code, notebook, docs, and no secret leakage | ✅ Completed | Person B (Ayush) |
+| DOC 09 | E12.2 | all | submission form prep | Prepare final submission links and partner track selections | DOC 07, DOC 08 | 0.5h | all submission fields have final links and verified accessibility | ✅ Completed | Person B (Ayush) - docs/submission_prep.md with links, tracks, and checklist |
+| DOC 10 | E12.2 | all | dry run | Run final three minute pitch plus two minute Q and A rehearsal | DOC 09 | 0.75h | team can explain tracks, reward, architecture, and results confidently | ✅ Completed | Person B (Ayush) - docs/pitch_outline.md with 3-min structure + Q&A prep |
+| DOC 11 | E12.1 | Person D | `README.md` | Add evaluation summary table for average reward, rounds to agreement, invalid action rate, agreement rate, and note the `/web` fallback route as backup demo path | DOC 03, DOC 04, TRN 15, API 19 | 0.5h | README results and setup sections reflect all promised metrics and clearly document the fallback demo route | ✅ Completed | Person B (Ayush) - evaluation table + /web fallback documented in README |
+---
+## 9. Critical path
+These tasks form the core chain that must not slip:
+1. FND 08, FND 09
+2. MOD 01 to MOD 05, MOD 11, MOD 12
+3. SCN 01 to SCN 09, SCN 13
+4. AGT 05 to AGT 07, AGT 11
+5. JDG 01 to JDG 05
+6. ENV 01 to ENV 06
+7. API 01 to API 10, API 13, API 14, API 16
+8. TRN 01 to TRN 08, TRN 13, TRN 14
+9. DOC 05 to DOC 09
+If any of these are blocked, the team should swarm and unblock immediately.
+---
+## 10. Suggested work allocation by time block
+## Block 1. Foundation and contracts
+**Duration target:** first 2 to 3 hours
+| Person | Highest priority tasks |
+| --- | --- |
+| Person A | FND 04, FND 08, FND 09, MOD 01 to MOD 05, MOD 11, MOD 12 |
+| Person B | MOD 09, AGT 01, AGT 02, AGT 11 |
+| Person C | FND 01 to FND 03, FND 05, FND 07, FND 10, FND 11, FND 12 |
+| Person D | FND 06, FND 13, initial UI shell planning, doc stub |
+## Block 2. One end to end scenario
+**Duration target:** next 3 to 4 hours
+| Person | Highest priority tasks |
+| --- | --- |
+| Person A | SCN 01 to SCN 04, SCN 13, JDG 01 to JDG 04, ENV 01 to ENV 03 |
+| Person B | AGT 03 to AGT 07, AGT 10 |
+| Person C | API 01 to API 03, API 13, API 14 |
+| Person D | UI 01 to UI 05 |
+## Block 3. Full environment plus deploy
+**Duration target:** next 3 to 4 hours
+| Person | Highest priority tasks |
+| --- | --- |
+| Person A | SCN 05 to SCN 10, JDG 11, ENV 04 to ENV 11 |
+| Person B | AGT 08, AGT 09, TRN 01 to TRN 04, TRN 13, TRN 14 |
+| Person C | API 04 to API 10, API 15 to API 19 |
+| Person D | UI 06 to UI 10, UI 13 |
+## Block 4. Training, docs, and polish
+**Duration target:** next 3 to 5 hours
+| Person | Highest priority tasks |
+| --- | --- |
+| Person A | TST 01 to TST 05, edge case fixes |
+| Person B | TRN 05 to TRN 15, TST 09 |
+| Person C | TST 06, TST 07, TST 11, OBS tasks, deployment fixes |
+| Person D | UI 11, UI 12, UI 14, UI 15, DOC 01 to DOC 07, DOC 11 |
+## Block 5. Final freeze
+**Duration target:** final 2 hours
+| Person | Highest priority tasks |
+| --- | --- |
+| All | TST 10 to TST 12, DOC 08 to DOC 11, final bug fixes only |
+---
+## 11. Acceptance criteria for the whole MVP
+The MVP is complete when all of the following are true:
+1. `ReplicaLabEnv` supports `reset()`, `step()`, `state()`, and `close()`
+2. At least one scenario family runs end to end, with a target of three
+3. The Scientist and Lab Manager can complete a multi round negotiation
+4. The Judge returns rigor, feasibility, fidelity, total reward, and deterministic audit notes
+5. Reward logs are persisted for completed episodes
+6. The server exposes health, reset, step, scenarios, and replay endpoints
+7. WebSocket sessions work without cross talk
+8. The environment is live on a public HF Space on port `7860`
+9. The Colab notebook can connect to the environment and complete training
+10. The notebook produces at least one reward curve
+11. The frontend can demonstrate one episode clearly, and the documented `/web` fallback works if the custom UI fails
+12. README explains setup, architecture, and results
+13. The repo is public
+14. The demo video is uploaded
+15. The team can explain which tracks and sponsor fits are being targeted
+16. Final terminal responses and replay logs include Judge audit notes and verdict
+17. Evaluation outputs report average reward, rounds to agreement, invalid action rate, and agreement rate
+---
+## 12. Nice to have backlog, only after MVP is green
+| Priority order | Task | Why it matters |
+| --- | --- | --- |
+| 1 | add side by side before versus after comparison in UI | strongest demo improvement visual |
+| 2 | add judge plain English explanation panel | better judge readability |
+| 3 | add second and third difficulty levels to all templates | stronger world modeling story |
+| 4 | add curriculum training path | stronger self improvement story |
+| 5 | add Lab Manager orchestrator with specialist subagents for compute, scheduling, budget, or risk review | stronger multi agent depth while preserving the same outer contract |
+| 6 | add third agent such as ethics reviewer | potential partner fit extension |
+| 7 | add post episode self critique before retry | stronger self improvement story from Blueprint Section 14.2 |
+| 8 | add automatic scenario difficulty scaling | adaptive curriculum from Blueprint Section 14.2 |
+---
+## 13. Risk register and mitigation
+| Risk | Likely impact | Mitigation owner | Mitigation plan |
+| --- | --- | --- | --- |
+| schema churn breaks integration | high | Person A | freeze contracts early and review all changes in PR |
+| RL training is unstable | high | Person B | keep the reward deterministic, train Scientist first, and keep the model-backed Lab Manager grounded by the deterministic checker with low-variance settings or frozen weights during Scientist training |
+| HF Space deployment issues | high | Person C | test local Docker first and keep `/health` simple |
+| frontend polish consumes too much time | medium | Person D | keep fallback to OpenEnv `/web` or a very thin React view |
+| reward too noisy or subjective | high | Person A | keep judge deterministic and rubric based |
+| final demo breaks live | high | all | keep replay logs and a pre tested demo seed ready |
+| too many scenarios | medium | Person A | ship one excellent scenario, then add more only if stable |
+| scenario adapters become mini business-logic engines | medium | Person A | keep adapters thin, emit normalized packs only, and push scoring or validation rules back into shared checker modules |
+| hybrid Lab Manager drifts from checker truth | medium | Person B | treat checker output as source of truth, derive final action fields from validated checker results, and use model-backed text only for negotiation language and alternatives |
+---
+## 14. Handoff contracts between workstreams
+### Environment to frontend contract
+The backend must expose:
+1. initial observation
+2. current round
+3. conversation log
+4. current proposed protocol
+5. score breakdown
+6. episode id
+7. replay payload
+8. CORS headers allowing frontend origin in dev and production
+### Environment to training contract
+The environment client must expose:
+1. `reset(seed, template, difficulty)`
+2. `step(action)`
+3. reward
+4. done
+5. final info including component scores
+6. API key or secret configuration for hosted-model access in both hosted and notebook environments
+### Scenario to judge contract
+Every scenario must provide:
+1. normalized scenario pack
+2. success criteria
+3. allowed substitutions
+4. constraints and resources
+5. hidden reference spec
+6. scenario id and seed
+---
+## 15. Team meeting rhythm
+| Meeting | Duration | Purpose |
+| --- | --- | --- |
+| kickoff sync | 15 min | confirm scope, owners, blockers |
+| integration sync | 10 min every 2 to 3 hours | merge timing and interface checks |
+| pre demo sync | 15 min | decide the exact demo path and backup path |
+| freeze sync | 10 min | only high severity fixes after this point |
+---
+## 16. Final recommendation on staffing focus
+If the team gets overloaded, protect this order:
+1. environment core
+2. reward engine
+3. server and deployment
+4. training notebook
+5. minimal UI
+6. README
+7. demo video
+8. extra scenarios
+9. extra polish
+The project wins on **clarity and working proof**, not on the largest number of features.
+---
+## 17. One sentence team mission
+**Build a deterministic OpenEnv world where a Scientist learns, through RL, to negotiate high quality technical plans with a constraint-aware Lab Manager across seeded domains, starting with mathematics and machine learning.**

ReplicaLab_Master_Blueprint.md ADDED Viewed

	@@ -0,0 +1,1097 @@

+# ReplicaLab Master Blueprint
+## 1. Executive summary
+**ReplicaLab** is an OpenEnv based scientific replication environment.
+In each episode, the system creates:
+1. An original experiment or paper summary
+2. A lab with real constraints such as budget, equipment, reagent stock, staffing, and time
+3. A negotiation task where a **Scientist agent** and a **Lab Manager agent** must agree on a valid replication plan
+The core idea is simple:
+**One agent knows what the science needs. One agent knows what the lab can actually do. They must negotiate a replication plan that is scientifically valid and realistically feasible.**
+This becomes a true environment because it has state, actions, observations, transitions, rewards, and episode termination. It is not just a chatbot prompt. It is a structured, trainable world.
+---
+## 2. The real world problem we are targeting
+ReplicaLab targets the gap between **ideal scientific protocols** and **real lab constraints**.
+In the real world, many experiments are hard to replicate because:
+1. Papers describe ideal methods
+2. Labs lack the full equipment or materials
+3. Budgets and schedules are limited
+4. Some substitutions are acceptable, but some break the science
+5. Teams must decide what is essential and what can change
+So the real question ReplicaLab asks is:
+**How do we adapt an experiment without breaking the science?**
+This is the practical version of the replication crisis problem.
+---
+## 3. One line pitch
+**ReplicaLab is an OpenEnv environment where a Scientist agent and a Lab Manager agent negotiate how to replicate scientific experiments under realistic lab constraints, and RL trains the Scientist to make better replication decisions over time.**
+---
+## 4. Which hackathon tracks we are following
+ReplicaLab touches **4 out of the 5** hackathon problem statements.
+### 4.1 Primary tracks
+#### A. Multi Agent Interactions
+This is the strongest fit.
+Why:
+1. The Scientist and Lab Manager hold different private information
+2. Neither can solve the task alone
+3. They must negotiate, exchange information, and converge
+#### B. World Modeling, Professional Tasks
+This is the second strongest fit.
+Why:
+1. The environment simulates a real scientific workflow
+2. The agent must reason inside a partially observable professional world
+3. It must infer what the lab can and cannot do before making a good plan
+### 4.2 Supporting tracks
+#### C. Long Horizon Planning and Instruction Following
+Why:
+1. The task takes several rounds
+2. The agent must ask, revise, recover from mistakes, and plan ahead
+3. Reward is delayed until a protocol is good enough
+#### D. Self Improvement
+Why:
+1. The same environment is used for RL training
+2. The Scientist improves across repeated episodes
+3. The environment supports curriculum and replay later on
+### 4.3 Track summary
+**Tracks touched technically:** 4
+**Tracks we should lead with in the pitch:** 2
+1. Multi Agent Interactions
+2. World Modeling
+**Tracks we should mention as supporting evidence:**
+1. Long Horizon Planning
+2. Self Improvement
+---
+## 5. Sponsor and partner alignment
+### 5.1 Best sponsor fits
+#### Halluminate
+Best fit because ReplicaLab is a true **multi actor environment**.
+1. The Scientist is one actor
+2. The Lab Manager is another actor
+3. The Judge can later act as a third oversight actor
+#### Snorkel AI
+Best fit because ReplicaLab behaves like **simulated experts in the loop**.
+1. The Scientist acts like a domain expert
+2. The Lab Manager acts like an operations expert
+3. The learning model improves through repeated expert style interactions
+### 5.2 Good optional fit
+#### Fleet AI
+This becomes stronger if the Judge is framed as an **oversight agent** that monitors, explains, and audits the decisions of the Scientist and Lab Manager.
+### 5.3 Resource fit
+1. **Hugging Face** for Spaces deployment and credits
+2. **Unsloth** for RL notebooks and simpler training setup
+3. **Northflank** for H100 access if faster training is needed
+4. **Cursor** for coding speed only
+---
+## 6. Why this is truly an environment
+ReplicaLab is an environment because it contains the full RL loop.
+### 6.1 State
+The state contains:
+1. The paper or experiment description
+2. The hidden minimum viable replication spec
+3. The lab constraints
+4. The round number
+5. The negotiation history
+6. The current proposed protocol
+7. The current score state
+8. Whether the episode is done
+### 6.2 Actions
+The Scientist can:
+1. Propose a protocol
+2. Revise a protocol
+3. Request information
+4. Accept
+The Lab Manager can:
+1. Report feasibility
+2. Suggest alternatives
+3. Reject
+4. Accept
+### 6.3 Observations
+Each role sees a different view of the world.
+The Scientist sees scientific requirements and negotiation state.
+The Lab Manager sees operational constraints and negotiation state.
+### 6.4 Transitions
+Each step updates:
+1. The conversation history
+2. The current protocol
+3. The round counter
+4. Budget usage if needed
+5. The done status if agreement happens or time runs out
+### 6.5 Reward
+The environment returns a score based on:
+1. Scientific rigor
+2. Feasibility
+3. Fidelity to the original experiment
+That is what makes it a trainable environment instead of a static task.
+---
+## 7. The core environment loop
+### 7.1 One episode
+1. `reset(seed=42)` creates a paper, a lab context, and a hidden evaluation rubric
+2. The Scientist receives its observation
+3. The Lab Manager receives its observation
+4. The Scientist acts first
+5. The Lab Manager responds
+6. This repeats for up to a fixed number of rounds
+7. If both accept, the episode ends successfully
+8. If time runs out, the episode ends with a penalty
+9. The Judge computes the final reward
+### 7.2 Environment methods
+The environment should implement:
+1. `reset()`
+2. `step()`
+3. `state()`
+4. `close()`
+These are the core methods that make the system compatible with OpenEnv serving and RL rollouts.
+---
+## 8. Scenario environments inside ReplicaLab
+For the MVP, we should use **3 scenario families**.
+### 8.1 MVP scenario families
+#### A. Cell Biology
+Example:
+Drug effect on cell proliferation using MTT or WST1 style assay
+Why it is good:
+1. Easy to explain
+2. Has obvious lab constraints
+3. Good match between rigor and feasibility tradeoffs
+#### B. Machine Learning Benchmark Replication
+Example:
+Reproducing a benchmark result with limited GPU budget and compute time
+Why it is good:
+1. Easier to simulate
+2. Good for judges who understand ML
+3. Strong world modeling story around compute, time, and reproducibility
+#### C. Behavioral Psychology Survey Study
+Example:
+Replicating a survey study with participant limits, time limits, and platform constraints
+Why it is good:
+1. Gives variety beyond wet lab work
+2. Shows broader scientific replication use case
+3. Easy to explain ethical and logistical constraints later on
+### 8.2 Stretch scenario families
+1. Biochemistry
+2. Materials Science
+3. Chemistry
+---
+## 9. How each model interacts with the others
+### 9.1 Scientist agent
+Role:
+Protect scientific validity
+Knows:
+1. The paper goal
+2. Important methodological elements
+3. Hidden scientific priorities through the environment design
+4. The negotiation history
+Does not directly know:
+1. Full budget
+2. Full inventory
+3. Full equipment schedule
+4. Full staffing details
+Main job:
+Design a protocol that still counts as a meaningful replication.
+### 9.2 Lab Manager agent
+Role:
+Protect operational feasibility
+Knows:
+1. Budget
+2. Equipment availability
+3. Booking conflicts
+4. Reagent stock
+5. Personnel constraints
+6. Safety restrictions
+7. The negotiation history
+Does not directly know:
+1. Which scientific elements are absolutely critical
+2. Which substitutions are scientifically acceptable unless told
+Main job:
+Tell the Scientist what is actually possible and suggest realistic alternatives.
+### 9.3 Judge agent
+Role:
+Audit the final plan and score it
+Knows:
+1. Original paper summary
+2. Minimum viable replication rubric
+3. Final protocol
+4. Actual constraints
+5. Full conversation history
+Main job:
+Compute the final reward and optionally explain it in plain English.
+---
+## 10. How the agents should be implemented
+### 10.1 MVP implementation choice
+For the hackathon MVP:
+1. **Scientist** should be the only trained LLM policy
+2. **Lab Manager** should be rule based and deterministic
+3. **Judge** should be a deterministic rubric engine with optional LLM explanation
+This is the safest and most realistic build path.
+### 10.2 Why only one agent should be trained first
+1. It reduces instability
+2. It makes reward improvement easier to show
+3. It makes the environment more deterministic and judge friendly
+4. It gives a clean before versus after story
+### 10.3 Scientist creation
+The Scientist can be built from a small instruct model with structured JSON output.
+The prompt should instruct it to:
+1. Protect scientific validity
+2. Ask for missing information before committing
+3. Output only valid schema fields
+4. Avoid invalid or impossible protocols
+### 10.4 Lab Manager creation
+The Lab Manager should be implemented as a deterministic policy layer that:
+1. Checks budget
+2. Checks equipment availability
+3. Checks stock and restock timing
+4. Checks staff limits
+5. Returns templated natural language plus structured feasibility data
+### 10.5 Judge creation
+The Judge should be implemented as:
+1. A rubric based scoring engine
+2. An audit note generator
+3. Optionally, an explanation layer that converts scores into readable comments for the frontend
+---
+## 11. How the judge agent is integrated
+The Judge is integrated **inside the environment**.
+It is called:
+1. At the end of the episode for final reward computation
+2. Optionally after each round for intermediate score previews
+### 11.1 What the Judge evaluates
+1. Whether critical controls were preserved
+2. Whether sample size is sufficient
+3. Whether substitutions are scientifically acceptable
+4. Whether the plan fits budget and inventory
+5. Whether the plan is faithful enough to the original design
+### 11.2 What the Judge returns
+1. `rigor_score`
+2. `feasibility_score`
+3. `fidelity_score`
+4. `total_reward`
+5. `judge_notes`
+### 11.3 Important design rule
+The Judge should not be the entire reward source through free form opinions.
+The Judge should primarily be a **deterministic rubric engine**.
+That makes training, replay, and scoring much more stable.
+---
+## 12. Reward structure
+The reward should be easy to explain and hard to game.
+### 12.1 Core reward dimensions
+#### A. Rigor
+Questions:
+1. Did the final plan preserve critical scientific elements?
+2. Are the controls present?
+3. Is sample size good enough?
+4. Is the technique valid?
+5. Is the study duration acceptable?
+#### B. Feasibility
+Questions:
+1. Is the plan within budget?
+2. Is the equipment actually available?
+3. Are the reagents in stock or restockable in time?
+4. Is the timeline realistic?
+5. Is staffing sufficient?
+#### C. Fidelity
+Questions:
+1. How close is the proposed protocol to the original experiment?
+2. Did the core method stay intact?
+3. Did the control logic stay intact?
+4. Is the sample size close enough?
+### 12.2 Composite reward
+Use a multiplicative core so the agent cannot cheat.
+```text
+base_reward = rigor * feasibility * fidelity * 10
+bonus = efficiency_bonus + communication_bonus
+penalty = timeout_penalty + invalid_action_penalty + over_budget_penalty
+final_reward = base_reward + bonus - penalty
+```
+### 12.3 Why this is good
+1. High rigor but impossible protocol still scores poorly
+2. Cheap but scientifically broken protocol still scores poorly
+3. Fast, thoughtful negotiation gets rewarded
+4. The score is intuitive for judges
+---
+## 13. How RL works in ReplicaLab
+### 13.1 Simple explanation
+RL works like this:
+1. The Scientist tries an action in the environment
+2. The environment responds through the Lab Manager and Judge logic
+3. The Scientist gets a reward at the end
+4. Training pushes the Scientist toward behaviors that earn higher rewards
+### 13.2 What behavior should improve
+Over time, the Scientist should learn to:
+1. Ask better questions before proposing
+2. Avoid impossible protocols
+3. Preserve critical scientific details
+4. Choose better substitutions
+5. Reach agreement faster
+6. Reduce invalid actions
+### 13.3 What model should be trained
+For the MVP, train only the Scientist.
+That gives the clearest reward curve and the cleanest training narrative.
+---
+## 14. How self improvement works
+### 14.1 MVP self improvement
+Self improvement in the MVP simply means:
+**The Scientist gets better after repeated episodes.**
+That is enough to satisfy the track.
+### 14.2 Stretch self improvement ideas
+1. Curriculum learning from easy to medium to hard scenarios
+2. Post episode self critique before retry
+3. Later training of both Scientist and Lab Manager
+4. Automatic scenario difficulty scaling
+---
+## 15. How world modeling is being done
+World modeling means the agent must reason about a hidden world and update its internal understanding over time.
+In ReplicaLab, that world includes:
+1. What equipment exists
+2. What equipment is missing
+3. Which items are booked
+4. What is in stock
+5. What can be substituted
+6. What is scientifically critical
+7. What tradeoffs hurt future feasibility
+The Scientist does not see all of this at once.
+So it must build a mental model of the lab through dialogue, feedback, and revision.
+That is why ReplicaLab fits the world modeling track strongly.
+---
+## 16. How long horizon planning is being done
+Long horizon planning appears because the task is multi step.
+A good Scientist should:
+1. Understand the experimental goal
+2. Ask for missing constraints
+3. Propose an initial protocol
+4. Revise after operational feedback
+5. Trade off rigor against feasibility
+6. Converge before timeout
+This is not one shot generation. It is multi round planning with delayed reward.
+---
+## 17. How constraints work
+Constraints come from a seeded scenario generator.
+### 17.1 Constraint categories
+1. Budget
+2. Time limit
+3. Equipment availability
+4. Equipment booking calendar
+5. Reagent stock
+6. Reagent restock timelines
+7. Personnel count
+8. Safety restrictions
+### 17.2 Difficulty levels
+#### Easy
+The lab has most of what is needed.
+#### Medium
+The lab is missing some important pieces and requires thoughtful substitutions.
+#### Hard
+The lab is missing major pieces and forces serious protocol redesign.
+### 17.3 How constraints should change
+For the MVP, keep each episode deterministic once the seed is fixed.
+That means:
+1. `reset(seed=42)` always produces the same paper and constraint world
+2. The world only changes because of the agents’ actions
+3. No random hidden shocks should happen inside an episode yet
+This makes testing and replay much stronger.
+---
+## 18. What the end result should be
+The end result is **not** a full system that proves whether a paper is true or false.
+The end result should be:
+1. A working OpenEnv environment
+2. A trained Scientist agent
+3. A stable Lab Manager policy
+4. A Judge rubric engine
+5. A public Hugging Face Space
+6. A training notebook that shows reward improvement
+7. A visual demo that clearly shows untrained versus trained behavior
+The final result we are trying to fit is:
+**a trainable benchmark and demo for scientific replication planning under constraints**
+---
+## 19. What the interface should look like
+### 19.1 Frontend choice
+**React + Vite** is the right choice.
+It is faster and cleaner than trying to build a full Cursor style IDE interface.
+### 19.2 UI layout
+#### Left panel
+1. Original paper summary
+2. Key scientific requirements
+3. Seed
+4. Scenario type
+5. Round counter
+#### Middle panel
+1. Negotiation log
+2. Scientist messages in blue
+3. Lab Manager messages in green
+4. Judge summary at the end
+#### Right panel
+1. Current proposed protocol
+2. Budget bar
+3. Inventory summary
+4. Score bars for rigor, feasibility, and fidelity
+5. Final composite score
+#### Bottom controls
+1. New episode
+2. Seed selector
+3. Scenario selector
+4. Replay slider
+5. Before versus after training toggle
+### 19.3 Fallback option
+If the custom UI slips, use the OpenEnv web interface as a fallback and polish only the essential display panels.
+---
+## 20. Architecture overview
+```mermaid
+flowchart TD
+    A[Scenario Templates] --> B[Scenario Engine]
+    B --> C[ReplicaLabEnv]
+    C --> D[Scientist Policy]
+    C --> E[Lab Manager Policy]
+    C --> F[Judge Rubric Engine]
+    D --> C
+    E --> C
+    F --> G[Step Result and Logs]
+    C --> G
+    G --> H[FastAPI and WebSocket Server]
+    H --> I[React Vite Frontend]
+    H --> J[Colab Training Client]
+    J --> K[TRL or Unsloth RL Training]
+    K --> L[Reward Curves and Evaluation]
+```
+---
+## 21. How exactly we are using the hackathon tools
+### 21.1 OpenEnv 0.2.1
+Used for:
+1. Defining the environment interface
+2. Creating the stateful RL world
+3. Serving the environment over FastAPI and WebSocket
+4. Enabling clients to connect locally or remotely
+### 21.2 Hugging Face Spaces
+Used for:
+1. Public deployment
+2. Judge accessible demo hosting
+3. Satisfying the official submission requirement
+### 21.3 Docker
+Used for:
+1. Packaging the backend and optional frontend
+2. Ensuring the app runs on port 7860 in HF Spaces
+### 21.4 Colab
+Used for:
+1. The required minimal training script
+2. Running rollouts against the environment
+3. Plotting reward improvement
+### 21.5 TRL or Unsloth
+Used for:
+1. Training the Scientist policy
+2. Applying RL against the environment reward
+3. Producing visible reward curves and before versus after behavior
+### 21.6 Matplotlib
+Used for:
+1. Reward curve visualization
+2. Component score plots
+3. Training summary charts
+### 21.7 GitHub
+Used for:
+1. Public source code
+2. README
+3. Notebook storage
+4. Architecture documentation
+### 21.8 YouTube
+Used for:
+1. The one minute demo video required by the hackathon
+---
+## 22. Scope of work
+### 22.1 In scope for the hackathon MVP
+1. OpenEnv environment implementation
+2. 3 scenario families
+3. Scientist as the trainable policy
+4. Rule based Lab Manager
+5. Deterministic Judge rubric engine
+6. FastAPI and WebSocket server
+7. Docker deployment
+8. Hugging Face Space
+9. Colab training notebook
+10. Reward curve
+11. React Vite frontend or clean fallback UI
+12. Public GitHub repo
+13. Demo video
+14. README
+### 22.2 Stretch scope if ahead of schedule
+1. LLM based Lab Manager
+2. Judge explanation LLM
+3. Live replay mode
+4. Before versus after split screen
+5. More scientific domains
+6. Difficulty curriculum
+### 22.3 Out of scope
+1. Proving a real paper is factually true or false
+2. Full autonomous laboratory automation
+3. Real wet lab execution
+4. Arbitrary paper ingestion from the internet
+5. Full self play between multiple LLM agents
+6. Complex enterprise integrations unrelated to the core demo
+---
+## 23. Folder structure
+```text
+replicalab/
+├── README.md
+├── pyproject.toml
+├── openenv.yaml
+├── .dockerignore
+├── replicalab/
+│   ├── __init__.py
+│   ├── models.py
+│   ├── client.py
+│   ├── prompts/
+│   │   ├── scientist.txt
+│   │   ├── lab_manager.txt
+│   │   └── judge.txt
+│   ├── scenarios/
+│   │   ├── templates.py
+│   │   ├── cell_biology.py
+│   │   ├── ml_benchmark.py
+│   │   └── behavioral_psych.py
+│   ├── scoring/
+│   │   ├── rubric.py
+│   │   ├── rigor.py
+│   │   ├── feasibility.py
+│   │   └── fidelity.py
+│   ├── agents/
+│   │   ├── scientist_policy.py
+│   │   ├── lab_manager_policy.py
+│   │   └── judge_policy.py
+│   ├── env/
+│   │   └── replicalab_env.py
+│   ├── utils/
+│   │   ├── seed.py
+│   │   ├── validation.py
+│   │   └── logging.py
+│   └── outputs/
+│       ├── logs/
+│       ├── replays/
+│       └── plots/
+├── server/
+│   ├── app.py
+│   ├── requirements.txt
+│   └── Dockerfile
+├── frontend/
+│   ├── package.json
+│   ├── vite.config.ts
+│   └── src/
+│       ├── App.tsx
+│       ├── components/
+│       └── pages/
+├── notebooks/
+│   └── train_colab.ipynb
+└── tests/
+    ├── test_env.py
+    ├── test_reward.py
+    ├── test_scenarios.py
+    └── test_server.py
+```
+---
+## 24. How the judges are likely to judge the project
+The hackathon judging criteria emphasize:
+1. Environment innovation
+2. Storytelling
+3. Training improvement
+4. Reward and pipeline coherence
+### 24.1 Why ReplicaLab scores well
+#### Environment Innovation
+Strong because this is a partially observable scientific negotiation world, not a toy single prompt task.
+#### Storytelling
+Strong because the Scientist versus Lab Manager framing is intuitive and memorable.
+#### Training Improvement
+Strong because the Scientist can visibly improve through RL and reward curves.
+#### Reward and Pipeline Coherence
+Strong because the scoring dimensions are simple and explainable.
+### 24.2 Ideal judge demo flow
+1. Show the problem in one sentence
+2. Start a seeded episode
+3. Show the paper and lab constraints
+4. Show the back and forth negotiation
+5. Show the score breakdown
+6. Replay the same seed with the trained Scientist
+7. Show higher reward and better decision quality
+---
+## 25. Completion rate expectations
+### 25.1 Project completion reality
+With a focused 4 person team, we should aim to complete:
+**90 percent of the judge critical MVP**
+Even if that is only around **60 percent of the full dream vision**, that is completely fine.
+### 25.2 Environment success metrics
+Track these metrics:
+1. Average reward
+2. Agreement rate
+3. Average rounds to agreement
+4. Invalid action rate
+5. Reward by scenario difficulty
+A strong demo should show:
+1. Higher reward after training
+2. Higher agreement rate after training
+3. Fewer invalid proposals after training
+4. Faster convergence after training
+---
+## 26. Team split for 4 people
+### Person 1: Environment and scoring owner
+Owns:
+1. Scenario generation
+2. Environment state and transitions
+3. Constraint system
+4. Reward logic
+5. Tests
+### Person 2: RL and model owner
+Owns:
+1. Scientist prompts and action schema
+2. Training notebook
+3. TRL or Unsloth integration
+4. Reward curves
+5. Before versus after evaluation
+### Person 3: Backend and deployment owner
+Owns:
+1. FastAPI server
+2. WebSocket protocol
+3. Docker image
+4. HF Spaces deployment
+5. Logs and replay endpoints
+### Person 4: Frontend and story owner
+Owns:
+1. React Vite UI
+2. Visual score panels
+3. Demo polish
+4. README
+5. One minute YouTube demo
+---
+## 27. Workflow for the team
+### 27.1 Build order
+1. Freeze environment schema and reward structure
+2. Build one scenario end to end
+3. Add deterministic Lab Manager
+4. Add Judge rubric engine
+5. Connect FastAPI and WebSocket serving
+6. Add basic frontend
+7. Add Colab training notebook
+8. Deploy to HF Space
+9. Add remaining scenarios
+10. Record demo and finish README
+### 27.2 Runtime workflow
+1. User starts a new episode
+2. The environment generates a seeded paper and lab
+3. The Scientist receives its observation
+4. The Lab Manager receives its observation
+5. The Scientist proposes or asks a question
+6. The Lab Manager replies with feasibility data
+7. The environment updates state
+8. The Judge computes intermediate or final scores
+9. The episode ends on agreement or timeout
+10. The replay is stored for demo and evaluation
+---
+## 28. Revenue model
+This is not needed for judging, but it is useful for investor or product framing.
+### 28.1 Possible revenue paths
+#### A. Enterprise experiment planning assistant
+Sell a planning and auditing tool to biotech and research organizations.
+#### B. Scientific AI benchmark licensing
+Offer ReplicaLab as a benchmark for labs or AI teams evaluating scientific agents.
+#### C. Simulation API
+Charge for API access to scenarios, scoring, and replay infrastructure.
+#### D. Workflow software expansion
+Expand later into experiment design, lab operations support, and protocol adaptation copilots.
+---
+## 29. Five year old explanation
+Imagine two kids want to bake a cake.
+1. One kid knows the recipe
+2. One kid knows what is inside the kitchen
+The recipe kid says, “We need chocolate.”
+The kitchen kid says, “We do not have chocolate, but we have cocoa.”
+Then they talk until they find the best cake they can make.
+If the cake still tastes good, uses what the kitchen has, and finishes on time, they get a star.
+ReplicaLab is that, but for science experiments.
+---
+## 30. Final recommended positioning
+### 30.1 Best main pitch
+**ReplicaLab is an OpenEnv scientific negotiation environment where a Scientist agent and a Lab Manager agent collaborate to design valid experiment replications under real world lab constraints. We train the Scientist with RL so it learns to ask better questions, make better tradeoffs, and reach better replication plans over time.**
+### 30.2 Best track framing
+**Primary:** Multi Agent Interactions and World Modeling
+**Supporting:** Long Horizon Planning and Self Improvement
+### 30.3 Best sponsor framing
+**Primary sponsor fit:** Halluminate and Snorkel AI
+**Optional supporting narrative:** Fleet AI through the Judge as an oversight layer
+### 30.4 Best MVP framing
+1. Train only the Scientist
+2. Keep the Lab Manager rule based
+3. Keep the Judge rubric based
+4. Ship 3 scenario families
+5. Show one strong before versus after training demo
+---
+## 31. Final “done” definition
+ReplicaLab is done for the hackathon when we have:
+1. A working OpenEnv environment
+2. A deployed HF Space on port 7860
+3. A public GitHub repo
+4. A Colab notebook with visible reward improvement
+5. A one minute YouTube demo
+6. A clear README
+7. A clean story that judges understand in under one minute
+That is the real finish line.

architecture.svg ADDED Viewed

Git LFS Details

SHA256: 8c42f85919a269668d5f16a93bde2d722937272867a1cb5e00c14ac6845826f8
Pointer size: 130 Bytes
Size of remote file: 19.9 kB

docs/Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

docs/agt11_scientist_model_selection.md ADDED Viewed

	@@ -0,0 +1,45 @@

+# AGT 11 Scientist Model Selection
+## Decision
+The primary Northflank and local training base for both role adapters is now
+**Qwen/Qwen3.5-9B**.
+The reduced-scale fallback is **Qwen/Qwen3.5-4B** for lower-memory smoke runs,
+faster iteration, and notebook fallback paths.
+The optional audit-only judge model candidate is
+**Qwen/Qwen3.5-122B-A10B**. It is not part of the deterministic reward loop.
+## Role Mapping
+- **Scientist**: `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA
+- **Lab Manager / Lab Research Assistant**: `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA
+- **Fallback Scientist or Lab Manager**: `Qwen/Qwen3.5-4B`
+- **Audit-only judge candidate**: `Qwen/Qwen3.5-122B-A10B`
+## Why Qwen3.5-9B For The Two Trainable Roles
+- It is a cleaner fit for the current Northflank H100 path than the older
+  `Qwen3-8B` baseline and keeps both trainable roles on one family.
+- It preserves enough planning headroom for strict JSON action output,
+  paper-grounded reasoning, and negotiation under constraints.
+- It still leaves a realistic fallback to the 4B variant when the team wants
+  faster notebook iteration.
+## Why Keep The Judge Deterministic
+- The reward source must stay reproducible across runs.
+- A large model judge is useful for audits, narrative analysis, and post-run
+  error review, but not for the scalar training reward.
+- This keeps benchmark history and before/after graphs comparable across runs.
+## Current Training Priorities
+1. Measure paper understanding explicitly on every evaluation run.
+2. Expand Scientist prompt coverage around paper understanding, constraint
+   grounding, and negotiation quality.
+3. Keep cumulative benchmark graphs updating across runs instead of only
+   saving one-off plots.
+4. Treat the execution-style lab environment as the next architecture phase,
+   not as an untracked reward change.

docs/ayush/README.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# Ayush Folder
+This folder holds Ayush-owned planning docs.
+Expected files:
+- `task_list.md`
+- `task_breakdown.md`
+- `notes.md`
+Update this folder whenever Ayush's next task, blockers, or handoff notes change.

docs/ayush/notebook_smoke_test.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# Notebook Smoke Test
+Purpose: verify that the training notebook and CLI-backed training flow run from a fresh runtime with frozen evidence packs and the bounded-tool policy enabled.
+Last verified on `2026-03-08` with:
+- `scientist-preview-smoke-20260308b`
+- `lab-manager-preview-smoke-20260308b`
+- `art-scientist-smoke-20260308b`
+- `art-scientist-compare-smoke-20260308b`
+## Fresh Runtime Setup
+1. Create a fresh Python environment or notebook runtime.
+2. Install the training dependencies:
+   - `pip install -e .`
+   - `pip install openpipe-art weave python-dotenv`
+   - `pip install unsloth trl datasets matplotlib openai`
+3. Confirm the local corpus exists:
+   - `data/papers/manifest.json`
+   - `data/papers/<field>/<paper-name>/paper.pdf`
+## Environment Variables
+Set these before running training or comparison:
+- `WANDB_API_KEY`
+- `ANTHROPIC_API_KEY` if Oracle features are being exercised
+- `HF_TOKEN` for local Unsloth model downloads
+- Optional: `REPLICALAB_PERSIST_ROOT`
+## Smoke Commands
+Run these in order:
+1. Scientist dataset preview
+```bash
+python -m replicalab.training.cli scientist-preview --persist-root replicalab/outputs/training --run-name scientist-preview-smoke --seed-count 2 --max-steps 12
+```
+2. Lab Manager dataset preview
+```bash
+python -m replicalab.training.cli lab-manager-preview --persist-root replicalab/outputs/training --run-name lab-manager-preview-smoke --seed-count 2
+```
+3. ART/OpenEnv Scientist RL smoke
+```bash
+python -m replicalab.training.cli art-scientist-train --persist-root replicalab/outputs/art-training --run-name art-scientist-smoke --project replicalab-ai --model-name replicalab-scientist-art-live --base-model OpenPipe/Qwen3-14B-Instruct --base-url https://ayushozha-replicalab.hf.space --train-steps 1 --rollouts-per-group 2 --max-turns 4 --max-completion-tokens 450 --max-parse-retries 2 --scenario-spec 0:ml_benchmark:easy 1:ml_benchmark:medium
+```
+4. Before vs after comparison smoke
+```bash
+python -m replicalab.training.cli scientist-compare-eval --persist-root replicalab/outputs/art-training --run-name art-scientist-compare-smoke --base-url https://ayushozha-replicalab.hf.space --transport rest --eval-seeds 101 --scenarios ml_benchmark --difficulties easy --project replicalab-ai --model-name replicalab-scientist-art-live --base-model OpenPipe/Qwen3-14B-Instruct
+```
+## What Must Exist After Success
+- `reports/summary.json`
+- `reports/metrics.jsonl`
+- `reports/run_metadata.json`
+- `manifests/evidence_packs.json`
+- `plots/*.png`
+## Bounded-Tool Assertions
+Check that:
+1. The Scientist prompt still includes `search_evidence`, `run_code_check`, and `inspect_image`.
+2. The run metadata records the bounded-tool policy.
+3. Metrics export includes invalid bounded-tool rate fields even when the value is `0.0`.
+## Failure Triage
+- If rollout collection fails before training starts, check the ReplicaLab server URL and `/health`.
+- If ART training fails after rollouts, inspect `reports/art_training_process.md` and the W&B run page.
+- If comparison eval collapses while baseline succeeds, inspect whether the trained checkpoint is undertrained rather than the environment contract being broken.

docs/ayush/notes.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# Ayush Notes
+Use this file for short-lived working notes, reminders, and handoff details.
+Do not use this file for durable deviations from the original plan. Put those in `docs/changes.md`.
+Current local training-data note:
+- A 50-paper experiment-design corpus now exists under `data/papers/`.
+- Use `data/papers/manifest.json` for the full scenario-to-paper mapping.
+- Most entries are marked `alternative` because many scenario titles in
+  `ReplicaLab_50_Scenarios_Training_Plan.md` are synthetic summaries rather
+  than directly downloadable published paper titles.
+Current V2 training architecture note:
+- The reusable training stack now lives under `replicalab/training/`.
+- `notebooks/train_minimal_colab.ipynb` is now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL.
+- `notebooks/train_colab.ipynb` is the judged notebook driver, but heavy runs
+  are expected to use the `replicalab-train` entrypoint on Northflank H100.
+- The primary shared base is now `Qwen/Qwen3.5-9B` with separate Scientist
+  GRPO and Lab Manager SFT adapters.
+- The reduced-scale fallback is `Qwen/Qwen3.5-4B`.
+- The audit-only judge candidate is `Qwen/Qwen3.5-122B-A10B`.
+- The deterministic rubric remains the only training reward source even when
+  Anthropic-backed oracle features are enabled for V2 overlays.
+- `docs/training_goals.md` now defines the current model goals and the
+  separation between metric improvements and the larger execution-env redesign.
+- A March 9 operational check found that the current Hugging Face token is
+  valid for Hub auth but belongs to a non-billable personal account
+  (`canPay=false`, no orgs), so it is not currently enough to provision paid
+  large-model hosting on Hugging Face.
+- The current Northflank manual job `replicalab-train` still has runtime env
+  values, but `northflank start job run` returns `409 No deployment
+  configured`, so the job cannot launch until a runnable image/deployment is
+  attached.
+- The live Northflank service on the same `nf-gpu-hack-16-64` plan does not
+  currently expose `nvidia-smi` or `/dev/nvidia*` inside the container, so GPU
+  availability should be treated as unverified until the runtime is fixed and a
+  direct hardware probe succeeds.
+Current Northflank notebook note:
+- The dedicated notebook service now lives in project `notebook-openport` as
+  service `jupyter-pytorch`.
+- The pasted notebook hostname `app--jupyter-pytorch--h74j66w224jx.code.run`
+  is stale; the live public notebook endpoint on 2026-03-09 is
+  `app--jupyter-pytorch--9y6g97v7czb9.code.run`.
+- The notebook runtime does expose a real `NVIDIA H100 80GB HBM3` GPU.
+- `/home/jovyan/replicalab-ai` and `/home/jovyan/replicalab-qwen3.5-grpo`
+  already exist in that notebook, with saved adapter checkpoints through
+  `checkpoint-200`.
+- The saved `grpo_training.log` shows the notebook ran on H100 but did not
+  complete cleanly: baseline eval emitted `string indices must be integers, not
+  'str'`, and the final inference cell failed in
+  `tokenizer.apply_chat_template(...)` with the same content-structure issue.
+Current ART/OpenEnv runtime note:
+- The active live Scientist RL path is now `art-scientist-train` in
+  `replicalab/training/cli.py`.
+- Fresh-runtime smoke validation completed on 2026-03-08 for:
+  - `scientist-preview-smoke-20260308b`
+  - `lab-manager-preview-smoke-20260308b`
+  - `art-scientist-smoke-20260308b`
+  - `art-scientist-compare-smoke-20260308b`
+- The live ART Scientist checkpoint reached `step7`, but the current trained
+  checkpoint still underperforms the deterministic baseline on held-out
+  comparison.
+- The main remaining work is experiment quality iteration, not missing training
+  infrastructure.
+- Evaluation summaries now track `paper_understanding` and
+  `communication_quality`, and the shared benchmark-history plots live under
+  `replicalab/outputs/training/history/`.
+Current localhost model-runtime note:
+- `server/app.py` now exposes `/runtime` and `/agent-step` so the local app can run a backend-selected Scientist policy instead of the frontend stub.
+- Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low.
+- Localhost therefore currently runs in `ollama` mode with `glm-5:cloud` as the working model-backed Scientist path.
+- The server applies a small deterministic safety adapter to model outputs before env stepping:
+  - trims controls to fit sample size
+  - aligns equipment and reagent requests to the available inventory
+  - clamps duration to the current lab time limit
+- If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.
+Current March 9 H100 benchmark note:
+- The full multi-round `scientist-local-compare-eval` path is live on the
+  Northflank H100 notebook, but the current notebook image is missing the fast
+  linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large
+  sharded rollout sweeps did not flush artifacts on a practical same-turn
+  timescale.
+- A fallback live H100 first-step benchmark was run on 2026-03-09 instead:
+  `250` shared reset cases with both baseline and trained Scientist first-step
+  actions, for `500` total simulations.
+- The merged artifact root is
+  `replicalab/outputs/training/h100-one-step-500-20260309/`.
+- The benchmark spans `34` trainable papers.
+- Summary result:
+  - baseline average first-step paper understanding: `0.61692084`
+  - trained average first-step paper understanding: `0.063866752`
+  - baseline average first-step reward: `0.3`
+  - trained average first-step reward: `0.05`
+  - trained request-info rate: `1.0`
+  - invalid-action rate stayed `0.0` for both labels
+- Scenario-level understanding:
+  - baseline `finance_trading`: `0.596033`
+  - trained `finance_trading`: `0.018182`
+  - baseline `ml_benchmark`: `0.633333`
+  - trained `ml_benchmark`: `0.099762`
+- Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is
+  materially worse than the deterministic baseline on first-step paper
+  grounding and currently behaves like a universal `request_info` policy under
+  a fast decode budget.

docs/ayush/task_breakdown.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# Person B (Ayush) Task Breakdown and Execution Plan
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## 1. Status
+Ayush's implementation lane is complete.
+Completed tasks in this lane now cover:
+1. Scientist prompting and parsing
+2. Baseline Scientist policy
+3. Shared deterministic Lab Manager grounding contributions
+4. Notebook and reusable training stack
+5. ART/OpenEnv rollout-to-trainer integration
+6. Metrics, plotting, evaluation, trained-policy loading, and metadata export
+7. Fresh-runtime notebook smoke validation
+The remaining training risk is no longer missing backlog work in Ayush's lane.
+It is model quality:
+1. The ART/OpenEnv Scientist runtime is live and reproducible.
+2. The latest live checkpoint still underperforms the deterministic baseline on held-out comparison.
+3. The next useful work is experiment iteration, not infrastructure completion.
+---
+## 2. Final Verification State
+The following validation steps are now complete:
+1. `scientist-preview` smoke run
+2. `lab-manager-preview` smoke run
+3. live `art-scientist-train` smoke run against the hosted ReplicaLab environment
+4. `scientist-compare-eval` smoke run against the trained checkpoint
+5. focused training-policy tests and CLI tests
+Smoke artifacts now exist under:
+1. `replicalab/outputs/training/scientist-preview-smoke-20260308/`
+2. `replicalab/outputs/training/lab-manager-preview-smoke-20260308/`
+3. `replicalab/outputs/art-training/art-scientist-smoke-20260308/`
+4. `replicalab/outputs/art-training/art-scientist-compare-smoke-20260308/`
+---
+## 3. Remaining External Work
+No Ayush-owned backlog items remain.
+Open work outside this lane that still matters to the final story:
+1. `TRN 12` owned by Person D: turn evaluation outputs into judge-facing result bullets
+2. UI and README result presentation tasks
+3. demo-storytelling tasks
+These are not blockers for the training runtime itself.
+---
+## 4. Next Technical Focus
+If work continues in this lane, it should target model improvement rather than missing task closure:
+1. Increase Scientist training coverage beyond the current smoke scenario set
+2. Inspect failure episodes from `art-scientist-compare-20260308-step5` and `art-scientist-compare-smoke-20260308`
+3. Add stronger warm-start or curriculum before more RL updates
+4. Execute the Lab Manager SFT path live and evaluate its effect separately
+5. Keep baseline-vs-trained comparisons on fixed seeds and frozen evidence packs
+6. Track `paper_understanding` and `communication_quality` on every eval run
+7. Keep the shared benchmark-history plots updating across runs
+8. Use `docs/training_goals.md` as the near-term model-goals reference
+---
+## 5. Base Model Assumptions
+Primary shared base: **Qwen3.5-9B**
+1. Scientist uses the shared base with a GRPO-style trainable adapter.
+2. Lab Manager uses the same shared base with a separate SFT adapter.
+3. `Qwen3.5-4B` remains the lower-memory fallback.
+4. `Qwen3.5-122B-A10B` is an audit-only judge candidate, not the reward source.
+5. The deterministic rubric remains the only training reward source.
+---
+## 6. Summary Table
+| Category | Count | Status |
+|----------|-------|--------|
+| Ayush-owned tasks remaining | 0 | Closed |
+| Technical blockers in Ayush lane | 0 | Closed |
+| Live runtime path | 1 | Validated |
+| Main remaining risk | 1 | Model quality, not infrastructure |

docs/ayush/task_list.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# Person B (Ayush) Task List
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## Current status
+- All Ayush-owned implementation tasks are now complete.
+- `TST 09` is now complete after the fresh-runtime smoke checklist was both written and exercised against the live ART/OpenEnv path.
+- The active training bottleneck is no longer missing infrastructure in Ayush's lane; it is model quality.
+- The current live Scientist ART checkpoint (`step6`) still underperforms the deterministic baseline on held-out comparison, so the next gains will come from better data, curriculum, reward shaping, and policy tuning rather than missing plumbing.
+---
+## Epic E02. Domain Models
+- [x] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01 | Status: completed on 2026-03-08
+---
+## Epic E03. Scenario Engine
+- [x] **SCN 11** | Create hand checked golden scenarios for prompt testing | 0.75h | Depends: SCN 09 | Status: completed on 2026-03-08
+---
+## Epic E04. Scientist Agent and Lab Manager Policy
+- [x] **AGT 01** | Draft domain-neutral system prompt for Scientist role from normalized scenario data | 0.75h | Depends: MOD 01, SCN 11 | Status: completed on 2026-03-08
+- [x] **AGT 02** | Build observation to prompt formatting helper from normalized scenario-derived observations | 0.75h | Depends: AGT 01, MOD 03 | Status: completed on 2026-03-08
+- [x] **AGT 03** | Add parse plus retry strategy for malformed model output | 0.75h | Depends: MOD 09, AGT 02 | Status: completed on 2026-03-07
+- [x] **AGT 04** | Build baseline heuristic Scientist for non trained smoke tests | 1h | Depends: AGT 02 | Status: completed on 2026-03-08
+- [x] **AGT 05** | Implement deterministic feasibility checker over normalized constraints and resources (shared with Person A) | 1.25h | Depends: SCN 07, MOD 05 | Status: completed on 2026-03-08
+- [x] **AGT 06** | Implement alternative suggestion logic from allowed substitutions and tradeoffs | 1h | Depends: AGT 05, SCN 08 | Status: completed on 2026-03-08
+- [x] **AGT 07** | Add model-backed Lab Manager response synthesis from checker output | 0.75h | Depends: AGT 05 | Status: completed on 2026-03-08
+- [x] **AGT 08** | Add prompt formatting and parse tests | 0.75h | Depends: AGT 01 to AGT 04 | Status: completed on 2026-03-07
+- [x] **AGT 10** | Write domain-neutral prompt text files for all three roles | 0.75h | Depends: AGT 01, AGT 07, JDG 06 | Status: completed on 2026-03-08
+- [x] **AGT 11** | Select and document base model for Scientist training | 0.5h | Depends: AGT 01 | Status: completed on 2026-03-08
+---
+## Epic E05. Judge Engine and Reward
+- [x] **JDG 10** | Expose component metrics for training plots | 0.5h | Depends: JDG 05, JDG 07 | Status: completed on 2026-03-08
+---
+## Epic E08. RL Training Pipeline
+- [x] **TRN 01** | Create notebook skeleton | 0.5h | Depends: API 10 | Status: completed on 2026-03-08
+- [x] **TRN 02** | Add package install and model setup cell | 0.75h | Depends: TRN 01 | Status: completed on 2026-03-08
+- [x] **TRN 03** | Implement environment client wrapper | 1h | Depends: API 06 | Status: completed on 2026-03-08
+- [x] **TRN 04** | Implement rollout collection loop | 1h | Depends: TRN 03, AGT 01 | Status: completed on 2026-03-08
+- [x] **TRN 05** | Connect rollouts to GRPO or equivalent trainer | 1.25h | Depends: TRN 04 | Status: completed on 2026-03-08
+- [x] **TRN 06** | Log episode reward, rigor, feasibility, fidelity, rounds | 0.75h | Depends: JDG 10, TRN 04 | Status: completed on 2026-03-08
+- [x] **TRN 07** | Plot reward curve and component curves | 0.5h | Depends: TRN 06 | Status: completed on 2026-03-08
+- [x] **TRN 08** | Add before versus after evaluation on fixed seeds | 1h | Depends: SCN 11, TRN 05 | Status: completed on 2026-03-08
+- [x] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05 | Status: completed on 2026-03-08
+- [x] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07 | Status: completed on 2026-03-08
+- [x] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06 | Status: completed on 2026-03-08
+- [x] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Status: completed on 2026-03-08 | Assumption now iterated to: Qwen3.5-9B primary, Qwen3.5-4B fallback, Qwen3.5-122B-A10B audit-only judge candidate
+- [x] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09 | Status: completed on 2026-03-08
+---
+## Epic E10. Logging and Observability
+- [x] **OBS 06** | Log training run metadata | 0.5h | Depends: TRN 06 | Status: completed on 2026-03-08
+---
+## Epic E11. Testing
+- [x] **TST 09** | Create notebook smoke test for fresh runtime | 0.5h | Depends: TRN 12 | Status: completed on 2026-03-08 after executing the smoke checklist against the live ART/OpenEnv path
+---
+## Shared Tasks
+- [x] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04 | Status: completed and signed off
+---
+## Totals
+| Metric | Value |
+|--------|-------|
+| Total tasks | 29 |
+| Completed | 29 |
+| Remaining | 0 |
+| Total estimated hours | 0h |

docs/changes.md ADDED Viewed

	@@ -0,0 +1,98 @@

+# Change Log
+This file records deviations from the original project plan.
+Rules:
+- Append new entries; do not rewrite history unless a prior entry is factually wrong.
+- Record the contributor, the task or area, the deviation, and the reason.
+- Update this file in the same branch or PR as the deviation whenever possible.
+| Date | Contributor | Task or Area | Deviation | Reason | Impact | Follow-up |
+| --- | --- | --- | --- | --- | --- | --- |
+| 2026-03-07 | Person B (Ayush) | FND 01 | Executed the task even though it was assigned to Person C | The repo scaffold was missing and needed immediately to unblock foundation work | Repo structure was created and tracking docs were updated to reflect the actual executor | None |
+| 2026-03-08 | Person B (Ayush) | FND 02 | Executed the task even though it was assigned to Person C | The Python package config was needed to verify editable installs and unblock `FND 11` | `pyproject.toml` was added, install verification was run, and tracking docs were updated | `FND 11` is now unblocked |
+| 2026-03-07 | Person B (Ayush) | FND 10 | Executed the task even though it was assigned to Person C | The output directories were still missing after the initial scaffold and needed for backlog compliance | `replicalab/outputs/` and subdirectories were added and tracking docs were updated | None |
+| 2026-03-08 | Person B (Ayush) | FND 04 | Executed the task even though it was assigned to Person A | The shared contract stubs were needed to unblock `FND 08` and downstream schema work | `replicalab/models.py` was created and tracking docs were updated | None |
+| 2026-03-08 | Person B (Ayush) | FND 05 | Executed the task even though it was assigned to Person C | Ignore rules were incomplete and needed to keep generated artifacts out of git and Docker context | `.gitignore` and `.dockerignore` were updated and tracking docs were aligned | None |
+| 2026-03-08 | Person B (Ayush) | FND 06 | Executed the task even though it was assigned to Person D | The existing README described a future state and needed to become an honest temporary stub for new contributors | `README.md` now reflects the current foundation stage and verified setup placeholder | `DOC 01` is now unblocked |
+| 2026-03-08 | Person B (Ayush) | FND 07 | Executed the task even though it was assigned to Person C | GitHub templates and explicit repo workflow artifacts were needed to reduce coordination overhead | PR and task templates were added and the project-management rules were tightened | Future PRs and task issues should use the new templates |
+| 2026-03-08 | Person B (Ayush) | Project management | Added governance docs and a deviation log outside the original backlog | Coordination overhead and tracking drift had become a project-management risk | New repo rules now govern future task tracking, docs updates, and deviation logging | Keep these docs in sync with future work |
+| 2026-03-08 | Person B (Ayush) | Project management | Replaced placeholder owner-doc folders with real-name folders for Kian, Max, and Kush | The team standardized on real names for owner-facing docs before future merges | Owner docs now live under `docs/kian/`, `docs/max/`, and `docs/kush/`, and the governance docs record the mapping | Use real-name folders for future owner-doc updates |
+| 2026-03-08 | Person B (Ayush) | PR 7 import for Max | Normalized a stale contributor PR before merge instead of merging it directly | The incoming branch would have deleted governance docs, reverted current task tracking, and overstated backend task completion | Only the validated backend subset was imported, `FND 11` was marked complete, and the stub-backed API work was recorded as partial | Real-env wiring, Docker validation, and deployment verification still remain |
+| 2026-03-08 | Person B (Ayush) | FND 08 and FND 09 | Recorded Kian-side sign-off for the shared contract and executed `FND 09` even though it was assigned to Person A | The same contributor is currently covering both the Kian and Ayush lanes, and the OpenEnv registration layer needed to be real rather than left as a placeholder | `FND 08` is now complete, `openenv.yaml` exists, and the repo now carries the minimal OpenEnv runtime wiring needed for local validation | The real environment class in `replicalab/env/replicalab_env.py` is still a later task |
+| 2026-03-08 | Person B (Ayush) | MOD 01 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict `ScientistAction` validator was the highest-leverage unblocker for downstream parser and validation work | `ScientistAction` now enforces the frozen contract, `MOD 09` and `MOD 05` are unblocked, and focused schema tests now exist in `tests/test_models.py` | `MOD 03` is the next schema-critical Kian task |
+| 2026-03-08 | Person B (Ayush) | MOD 02 and MOD 03 | Executed the tasks even though they were assigned to Person A | The Kian and Ayush lanes are being covered together, and the strict Lab Manager plus typed observation contracts were the fastest way to stabilize the shared schema surface before parser, state, and environment work fan out | `LabManagerAction`, `ConversationEntry`, `Protocol`, and both observation branches now enforce the frozen contract, `MOD 04` and `MOD 11` are unblocked, and the stub server path is verified against the typed models | `MOD 12`, `SCN 01`, and `MOD 05` are the next Kian-lane tasks |
+| 2026-03-08 | Person B (Ayush) | MOD 12 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and centralizing shared defaults was the cleanest way to stop config drift before the real environment and scoring modules expand | `replicalab/config.py` now holds shared defaults for scenario selection, difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, and the server plus scenario builders import them instead of repeating literals | `MOD 05`, `MOD 04`, and `MOD 11` remain the next Kian-lane foundation tasks |
+| 2026-03-08 | Person B (Ayush) | MOD 11 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and a typed step-result contract was needed before the environment, API, replay, and training paths grew around loose metadata | `RewardBreakdown`, `StepInfo`, and typed `StepResult.info` now exist, and the stub runtime explicitly constructs those reserved-key payloads while preserving debug metadata | `MOD 04` and `MOD 05` were the remaining Kian-lane foundation tasks after this |
+| 2026-03-08 | Person B (Ayush) | MOD 04 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and state plus replay needed to use the same typed protocol and conversation models already enforced at the action and observation layers | `EpisodeState` and `EpisodeLog` now carry typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` fields, the stub runtime constructs those nested models explicitly, and replay serialization is now aligned with the typed contract | `MOD 07` and `ENV 01` are now unblocked |
+| 2026-03-08 | Person B (Ayush) | MOD 05 | Executed the task even though it was assigned to Person A | The Kian and Ayush lanes are being covered together, and structural schema validation was not enough to stop impossible or hallucinated plans from reaching the environment | `replicalab/utils/validation.py` now provides deterministic protocol validation against normalized scenario resources, substitutions, time limits, and required elements, returning structured issues instead of relying on ad hoc runtime checks | `MOD 06` and shared `AGT 05` are now unblocked |
+| 2026-03-08 | Person B (Ayush) | SCN 01 to SCN 10 | Executed the full scenario-engine prerequisite bundle even though it was assigned to Person A and originally sequenced after `MOD 04` | `SCN 11` and `AGT 01` needed a real normalized scenario generator rather than another placeholder, and the Kian plus Ayush lanes are being covered together | The repo now has deterministic seeded scenario generation for mathematics, machine learning, and finance-trading planning, plus golden fixtures and seeded scenario tests; `SCN 11`, `AGT 01`, and the stub server scenario list are now backed by the same normalized scenario pack | `MOD 04` still needs to thread the normalized scenario pack through `EpisodeState` and replay models cleanly |
+| 2026-03-08 | Person B (Ayush) | Architecture roadmap | Shifted the planning docs from lab-first replication toward a normalized multi-domain scenario layer with mathematics and machine learning first, finance and trading planning third, and physics or biology later | The team wants the environment to stay domain-agnostic under a stable outer contract while keeping the reward deterministic and making the Lab Manager stronger for the hackathon story | The source-of-truth backlog, README, and Kian or Ayush planning docs now assume `scenario adapter -> normalized scenario pack -> observation mapper -> stable contracts`, plus a hybrid Lab Manager with deterministic feasibility grounding | `SCN 02`, `SCN 07`, `SCN 08`, `AGT 01`, `AGT 05`, `AGT 07`, and the judge wording must now be implemented to this architecture |
+| 2026-03-08 | Person B (Ayush) | FND 03 and FND 12 | Imported the frontend shell and Vite proxy config from Kush's branch even though both tasks are assigned to Max | The `ayush` integration branch only had the frontend scaffold, and the validated frontend from `origin/Kush` needed to exist on the integration branch for future UI and deployment work | `frontend/` now contains the full React plus Vite app, `frontend/vite.config.ts` is present with API and WebSocket proxy rules, and local validation passed with `npm --prefix frontend install` plus `npm --prefix frontend run build` | `FND 13` and `UI 01` are now unblocked; remaining UI tasks still need explicit review before being marked complete |
+| 2026-03-08 | Person B (Ayush) | Capability scope and backlog | Expanded the MVP from pure constrained negotiation to bounded evidence-backed research planning with scoped search, code-check, and image-inspection capability, while explicitly excluding audio and unrestricted live web in training | The team decided that research applicability requires richer capabilities, but the hackathon still needs a deterministic RL story with bounded tools and reproducible rewards | The source-of-truth backlog now treats richer capabilities as an additive layer below the frozen outer contract; completed schema and agent work stays valid, while pending prompt, judge, environment, API, and training tasks now absorb bounded tool and evidence-pack support | Keep live web mostly for demo or eval validation, and keep frozen evidence packs as the default training path |
+| 2026-03-07 | Person B (Ayush) | AGT 03 | Backlog showed "Not started" but the implementation (parse-and-retry loop with telemetry) already existed from a prior commit | The code and 7 tests were committed earlier but the tracker was never updated | Synced both `ReplicaLab_Comprehensive_Task_Division.md` and `docs/completion.md` to reflect completed status | None |
+| 2026-03-07 | Person B (Ayush) | AGT 08 | Expanded scope from test-only to tests plus a bounded-tool policy prompt patch in `build_scientist_system_prompt()` | The acceptance criteria required testing bounded-tool policy reminders, but no tool-policy text existed in the prompt yet; user directed adding the prompt text alongside the tests | Added policy block for `search_evidence`, `run_code_check`, and `inspect_image` to the system prompt; wrote 24 new tests covering parser, prompt, formatter, baseline, and bounded-tool policy; all 111 tests pass | None |
+| 2026-03-08 | Person B (Ayush) | ENV 01 | Executed the task even though it was assigned to Person A | The real environment class was still missing, but the server now switches to `ReplicaLabEnv` on successful import, so a working drop-in module was needed before environment and API work could safely proceed | Added `replicalab/env/replicalab_env.py` and `replicalab/env/__init__.py` as a working drop-in replacement for the former in-server stub, verified direct `reset() -> step() -> state() -> close()` behavior, and confirmed the full test suite stays green at `111 passed` | `ENV 02` and `ENV 08` are now unblocked, and the server can instantiate the real env class instead of the fallback stub |
+| 2026-03-08 | Person B (Ayush) | JDG 01, JDG 02, JDG 03 | Executed three scoring tasks assigned to Person A | The judge scoring chain was the next critical-path blocker: JDG 04 (total reward formula) depends on all three, and ENV 06 (reward integration) depends on JDG 05 which depends on JDG 04 | Added `replicalab/scoring/rigor.py` (weighted structural completeness, success criteria coverage, required element coverage), `replicalab/scoring/feasibility.py` (7-dimension partial-credit scorer wrapping AGT 05 feasibility checker), `replicalab/scoring/fidelity.py` (substitution-aware hidden-reference adherence scorer), shared `replicalab/utils/text.py` (token extraction and label normalization), `replicalab/scoring/__init__.py` (exports), and `tests/test_reward.py` (18 tests covering ordering, determinism, partial credit, domain range, and cross-scorer consistency); all 134 tests pass | JDG 04 is now unblocked; tracker docs were synced separately |
+| 2026-03-08 | Person B (Ayush) | ENV 02, ENV 03, ENV 04, ENV 05, ENV 06, ENV 07, ENV 08, JDG 04, JDG 05, TST 01, TST 02, TST 03 | Executed the full environment chain and rubric tasks assigned to Person A | The environment needed real scenario wiring, validation, grounded Lab Manager responses, centralized termination, judge-computed rewards, deep state snapshots, and close lifecycle guards; the rubric needed the total reward formula and breakdown builder; and the test suite needed reset, step, and invalid-action coverage | Rewrote `replicalab/env/replicalab_env.py` (ENV 02-08: scenario-pack-backed observations, protocol validation, grounded LM pipeline, accept-or-max-rounds termination, real judge scoring via rubric, deep state copies, closed-env guard), created `replicalab/scoring/rubric.py` (JDG 04-05: `compute_total_reward` with `10 × r × f × fi + bonuses − penalties`, `build_reward_breakdown` composing all three sub-scores with efficiency bonus), updated `replicalab/scoring/__init__.py` exports, and created `tests/test_env.py` (TST 01-03: 32 tests covering reset, step, invalid action, state snapshot, close/reopen, and rubric); all 166 tests pass | JDG 06, JDG 08, ENV 10, ENV 11, TST 04, TST 05 are now unblocked; partial server tasks (API 02, 03, 06, 07) can now wire against the real env |
+| 2026-03-07 | Person B (Ayush) | JDG 04, JDG 05, ENV 06 finalization | Refined the draft implementations to match final acceptance criteria | JDG 04 needed a zero-clamp floor and JDG 05 needed a named-penalty extension point for bounded-tool diagnostics; ENV 06 needed to distinguish timeout from no-agreement verdicts | `compute_total_reward` now clamps at 0.0; `build_reward_breakdown` accepts optional `penalties: dict[str, float]` for named penalty keys like `invalid_tool_use` and `unsupported_claim`; terminal-without-agreement path now returns `timeout` when max rounds reached vs `no_agreement` otherwise; added 8 new tests in `test_reward.py` and 4 new tests in `test_env.py`; 178 tests pass across the full suite | None |
+| 2026-03-07 | Person B (Ayush) | API 03 | Completed the `POST /step` endpoint task assigned to Person C by fixing stale replay logging and adding endpoint tests | The `_build_episode_log()` helper still hardcoded stub audit notes, rebuilt `RewardBreakdown` from state, and used `accept`/`revise` instead of the real `timeout`/`no_agreement` verdicts; both REST and WebSocket terminal paths used the stale helper; and no `/step` endpoint tests existed | Updated `_build_episode_log()` to accept the terminal `StepResult` and use its real `reward_breakdown`, `judge_notes`, and `verdict`; updated both REST `/step` and WebSocket step completion paths to pass the result; fixed `_StubEnv` reference to removed helper; added five endpoint tests covering happy path, invalid session 404, terminal real reward breakdown, semantic invalid action as 200 with `info.error`, and replay with real judge data; all 183 tests pass | API 14 and API 18 are now closer to completion; TST 06 is partially covered by the new tests |
+| 2026-03-07 | Person B (Ayush) | API 06 and TST 07 | Executed the WebSocket session handler task and its test task even though both were assigned to Person C | The WebSocket handler already existed in `server/app.py` but had no test coverage, and completing `API 06` was needed to unblock `TRN 03` and `TRN 13` in Person B's own lane | Added 12 WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict with real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`; all 195 tests pass; `TRN 03` and `TRN 13` are now unblocked for Person B | `TRN 03` and `TRN 13` are now the next Person B tasks |
+| 2026-03-08 | Person B (Ayush) | API 13 | Executed the task even though it was assigned to Person C | The CORS middleware already existed in `server/app.py`, but the task was still partial because frontend-origin verification had not been made explicit | Added three server tests covering localhost Vite preflight, Hugging Face Space origin preflight, and disallowed-origin rejection; `API 13` is now recorded complete in the source of truth and owner trackers | `API 02`, `API 04`, `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
+| 2026-03-08 | Person B (Ayush) | API 04 | Executed the task even though it was assigned to Person C | The `/scenarios` endpoint and its focused tests already met the acceptance criteria, but the task was still marked partial in the trackers | Recorded `API 04` complete in the source of truth and owner trackers based on the existing typed response model, normalized family list, and five dedicated endpoint tests | `API 07`, `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
+| 2026-03-08 | Person B (Ayush) | API 02 | Completed the `POST /reset` endpoint verification and test closure even though the task was assigned to Person C | The endpoint already worked against the real env via `_make_env()` but had no dedicated test coverage and was still marked partial in the tracker | Added seven dedicated `/reset` endpoint tests covering response shape, both-role observation, explicit session_id reuse with prior-env close, default params, all scenario and difficulty combos, and seed determinism; all 202 tests pass; `API 14` and `UI 06` are now closer to completion | None |
+| 2026-03-08 | Person B (Ayush) | TRN 13 | Implemented `replicalab/client.py` as specified in the task backlog | `API 06` was complete and `TRN 13` was the next unblocked Person B task | Created `ReplicaLabClient` with dual-transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager, internal session tracking, typed Pydantic returns, and 24 tests covering both transports; all 231 tests pass | `TRN 03` is now the next unblocked Person B task |
+| 2026-03-08 | Person B (Ayush) | API 07 | Completed the WebSocket idle-timeout and graceful-disconnect verification even though the task was assigned to Person C | The idle-timeout logic and `finally: env.close()` path already existed in `server/app.py`, but the task was still partial because resource-cleanup verification had not been made explicit | Added two focused WebSocket tests covering idle timeout close code `1000` and exactly-once `env.close()` on disconnect; `API 07` is now recorded complete in the source of truth and owner trackers | `API 08`, `API 14`, and `OBS 02` remain in Max's active lane |
+| 2026-03-08 | Person B (Ayush) | API 08 | Completed the Docker build and run verification even though the task was assigned to Person C | The Dockerfile existed but had never been verified end to end; editable install failed inside Docker, and `httpx` plus `websocket-client` were missing from `server/requirements.txt` | Fixed `pip install -e .` to `pip install .` in both `server/Dockerfile` and root `Dockerfile`; added `httpx` and `websocket-client` to `server/requirements.txt`; rebuilt without cache; verified container starts with `"env":"real"` and all four endpoints (`/health`, `/scenarios`, `/reset`, `/step`) respond correctly; added verified endpoint commands to `docs/max/deployment.md` | `API 09` and `API 16` are now unblocked |
+| 2026-03-08 | Person B (Ayush) | Recovery sync, API 09, API 15, TST 04, TST 05 | Recovered the lost env or server or client or test bundle from unreachable git objects and re-synced the deployment/testing trackers to the validated repo state | The branch had rolled back to `5538ba0`, which left the working code, deployment metadata, and tracker files out of sync even though the recovered code now passes 231 tests, Docker validation, and OpenEnv validation | Restored the missing runtime files, revalidated the real env and Docker path, recorded HF Space metadata tasks (`API 09`, `API 15`) as complete, and closed the two reward-regression tests (`TST 04`, `TST 05`) that are already covered in `tests/test_reward.py` | Live HF Space bring-up remains `API 10` |
+| 2026-03-08 | Person B (Ayush) | JDG 08 | Executed the task even though it was assigned to Person A | The judge stack needed stronger regression coverage before parallel training and deployment work fan out, and the current reward tests did not yet cover the most important ordering and edge-case scenarios explicitly | Added five focused `tests/test_reward.py` regressions covering good-vs-awful ordering across all judge axes and total reward, success-criteria sensitivity for rigor, partial equipment credit for feasibility, direct-match vs substitution vs miss ordering for fidelity, and reward-breakdown determinism with and without a precomputed feasibility check; full suite now passes at 264 tests | `JDG 06`, `AGT 09`, `SCN 13`, and `ENV 10` remain the next Kian-lane tasks |
+| 2026-03-08 | Person B (Ayush) | MOD 06 | Completed the semantic impossibility validators even though the task was assigned to Person A | The dependency `MOD 05` was complete and the validators extend the same `validate_protocol()` function | Added `_check_semantic_impossibilities()` with five checks (zero sample with controls, controls >= sample size, duplicate controls/equipment/reagents) and seven new tests; all 223 non-live-server tests pass; valid protocols remain unaffected | `MOD 08` (unit tests for schemas and validators) is partially unblocked |
+| 2026-03-08 | Person B (Ayush) | JDG 06 | Implemented the plain-English judge explanation layer even though the task was assigned to Person A | `JDG 05` was complete, the explanation function was fully deterministic and isolated, and finishing it immediately unblocked Ayush's `AGT 10` prompt-file task | Added `replicalab/scoring/explain.py`, exported `explain_reward(...)` through `replicalab.scoring`, and covered it with nine focused reward tests without changing any scoring math | `AGT 10` is now unblocked; `JDG 11` can now package the explanation into the final audit payload |
+| 2026-03-08 | Person B (Ayush) | JDG 11 | Implemented the structured final audit payload even though the task was assigned to Person A | Both dependencies (`JDG 05`, `JDG 06`) were complete, and the audit builder is a pure deterministic formatter with no scoring changes | Created `replicalab/agents/judge_policy.py` with `JudgeAudit` model and `build_judge_audit()` builder; derives verdict, reuses `explain_reward()` for notes, extracts top failure reasons from weak components and penalty keys; exported through `replicalab.agents`; ten tests pass; 255 full suite passes | `ENV 11`, `UI 13`, and `OBS 09` are now unblocked |
+| 2026-03-08 | Person B (Ayush) | SCN 13 and AGT 09 | Executed two Person A tasks to keep the Kian lane consistent with the implemented repo state | `SCN 13` was already implemented in the scenario layer and `AGT 09` was already implemented as deterministic Lab Manager regression coverage, but both were still left open in the tracker flow | Recorded `SCN 13` complete in the normalized scenario layer and `AGT 09` complete in the Lab Manager grounding test stack, bringing the source-of-truth backlog, completion rollup, and Kian owner docs back into sync with code and tests | `ENV 10` and `ENV 11` are now the remaining unblocked Kian-lane tasks |
+| 2026-03-08 | Person B (Ayush) | ENV 11 | Finished the env-side audit integration on Person A's lane and closed the replay-state gap | The env already attached `judge_notes` and `verdict` to terminal `StepResult` and `EpisodeState`, but replay logs were still dropping `top_failure_reasons`, so the task was only partially complete against its own acceptance text | Added `top_failure_reasons` to the replay `EpisodeLog` build path in `server/app.py`, kept the canonical env audit source in `replicalab/env/replicalab_env.py`, and verified terminal audit payload survival through env tests and replay endpoint tests | `ENV 11` is now fully closed; Kian's only fully unblocked task is `ENV 10`, while `API 18` and `OBS 09` are each one dependency closer |
+| 2026-03-08 | Person B (Ayush) | ENV 10 | Executed the deterministic replay and broader environment regression suite even though the task was assigned to Person A | The environment lifecycle and audit stack were complete, but the repo still needed proof that same seed plus same action sequence yields the same trajectory and final state across all supported families without depending on file-backed replay persistence | Added replay-determinism coverage to `tests/test_env.py` for same-seed initial observations, same-seed same-action trajectories, timeout determinism, invalid-action determinism, and terminal audit replay stability across math, ML, and finance families; full suite now passes at 327 tests | `OBS 04` is now unblocked, while `MOD 08` still waits on `MOD 07` |
+| 2026-03-08 | Person B (Ayush) | OBS 04 | Closed the replay-observability test task on Person A's lane using the new deterministic env replay suite | `OBS 04` depends on `ENV 10`, and the completed `TestReplayDeterminism` block already proves same-seed same-action replay consistency across the full environment stack, so leaving the task open would only create tracker drift | Recorded `OBS 04` complete against the existing `tests/test_env.py` replay determinism coverage without adding redundant second-copy tests; the observability lane now has its deterministic replay guard in the env test suite | Kian has no fully unblocked implementation task left; `MOD 08` still waits on `MOD 07` |
+| 2026-03-08 | Person B (Ayush) | AGT 10 | Implemented the role prompt files and loader helpers in code after the deterministic judge explanation layer landed | `AGT 10` was unblocked by `JDG 06`, and keeping the prompt source in versioned files was cleaner than scattering role text across notebook cells or inline string literals | Added `replicalab/prompts/scientist.txt`, `lab_manager.txt`, and `judge.txt` plus rendering helpers in `replicalab/prompts/__init__.py`, with six tests covering loadability, placeholder rendering, and bounded-tool rules | The role prompt bundle is now stable for notebooks, demos, and later model calls |
+| 2026-03-08 | Person B (Ayush) | TRN 04 | Implemented the rollout collection loop as a reusable Python module rather than only inside a notebook | The backlog labels `TRN 04` as notebook work, but implementing it in `replicalab/training/rollout.py` makes the same rollout logic reusable across notebooks, tests, and future trainer code while preserving the required behavior | Extended `RolloutWorker` with terminal `StepInfo`, bounded tool trace aggregation, and `collect_rollouts(...)`; added trace and batch tests in `tests/test_rollout_traces.py` and kept the rollout logic fully testable outside a notebook | `TRN 05` is now unblocked and notebooks can import the rollout loop instead of reimplementing it |
+| 2026-03-08 | Person B (Ayush) | API 14 | Completed the REST session isolation verification even though the task was assigned to Person C | The session isolation logic already worked correctly in `server/app.py`; the task was still marked partial because no dedicated tests proved concurrent-user isolation against the real env | Created `tests/test_api_rest_isolation.py` with 11 tests covering session independence, round-count isolation, terminal isolation, session_id reuse, invalid session handling, and replay isolation; no server changes needed; 307 tests pass | No new dependencies unblocked; `API 14` was the last partial API task besides `API 01` and `OBS 02` |
+| 2026-03-08 | Person B (Ayush) | MOD 07 and MOD 10 | Closed the replay-persistence and schema-example tasks on Max's lane after verifying the code that had already landed | `replicalab/utils/logging.py` and the API example generator were implemented and passing tests, but the source-of-truth backlog and Max's owner docs still showed both tasks as not started, and the generated examples still contained stale stub audit text | Updated `tests/fixtures/generate_api_examples.py` to derive terminal judge metadata from the current deterministic judge helpers, regenerated `api_schema_examples.json`, and synced `MOD 07`/`MOD 10` to complete in the comprehensive backlog, completion rollup, and Max owner docs | `MOD 08` and `JDG 07` are now clearly unblocked in the tracked plan |
+| 2026-03-08 | Person B (Ayush) | Reward shaping and rubric refinement | Expanded the reward system beyond terminal-only scoring without reopening the outer action or observation contract | Sparse terminal-only reward was too weak for RL training, and the project needed deterministic shaping rather than a frontier-model reward source | Added a parsimony term to terminal reward, introduced deterministic step shaping in `ReplicaLabEnv` (information gain, protocol delta, momentum, contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement signals), updated rollout aggregation to use cumulative episode reward, and aligned env/server tests to the new shaped-reward semantics while keeping the full suite green at 356 tests | Keep the notebook and training plots explicit about terminal reward components vs cumulative shaped episode reward |
+| 2026-03-08 | Person B (Ayush) | Oracle hybrid architecture | Added an Oracle-style frontier-model layer as an additive integration instead of replacing the deterministic environment and reward stack | The sponsor-facing V2 direction calls for a model-driven intelligence layer woven through scenario generation, environment interaction, and explanation, but the RL training path still needs deterministic reward and reproducible evaluation | Added `oracle_models.py`, `oracle.py`, `cache.py`, Oracle prompt assets, an optional model-backed Lab Manager wrapper, an adapter from Oracle scenarios into the existing normalized scenario pack, and feature-flagged Oracle hooks in `ReplicaLabEnv`; kept deterministic scoring in `replicalab/scoring/*` as the canonical training reward; expanded test coverage with `test_oracle.py`, `test_cache.py`, and Oracle adapter/prompt tests; full suite now passes at 365 tests | If this grows beyond the current additive mode, record any future contract or reward-source changes separately before altering the deterministic training path |
+| 2026-03-08 | Person B (Ayush) | Deployment access tooling | Added Northflank CLI installation verification and service-operation commands to `docs/max/deployment.md` even though the original deployment docs were HF-Space-centric | The active service now also needs a documented Northflank access path for forwarding, logs, shell access, and file transfer | Backend deployment docs now include the verified local CLI install (`northflank` 0.10.16), login command shape, and the `replica-labs` / `replicalab-ai` service commands | Actual login still requires a user-supplied account token outside the repo |
+| 2026-03-08 | Person B (Ayush) | Local paper corpus for training and experiment design | Added a new local dataset under `data/papers/` sourced from `ReplicaLab_50_Scenarios_Training_Plan.md`, which is outside the original tracked backlog artifacts | The training-plan draft now calls for a 50-paper corpus to support experiment-design grounding, but many scenario titles are synthetic summaries rather than directly retrievable publication titles | Downloaded 50 open-access PDFs into `data/papers/<field>/<paper-name>/`, added per-paper metadata plus `data/papers/manifest.json`, and marked substitute papers explicitly when the exact scenario title could not be matched cleanly | If the team wants this corpus versioned in git or refreshed later, keep using the manifest as the source of truth for replacements and provenance |
+| 2026-03-08 | Person B (Ayush) | MOD 08 | Completed the comprehensive schema and validator unit test task on Person A's lane | All MOD 01–07 dependencies were complete, and the task was the last remaining item in Kian's backlog | Created `tests/test_mod08_schemas.py` with 70 unit tests covering all Pydantic model edge cases across 11 test classes (ScientistAction, LabManagerAction, Protocol, ConversationEntry, RewardBreakdown, Observation, LabManagerObservation, StepInfo, StepResult, EpisodeState, EpisodeLog); full suite passes at 409 tests | Kian's lane is now 100% complete (49/49 tasks) |
+| 2026-03-08 | Person B (Ayush) | JDG 07 | Closed the reward-breakdown logging task on Max's lane after verifying the implementation already meets all acceptance criteria | `append_reward_csv()`, `append_reward_jsonl()`, and `log_episode_reward()` were already implemented in `replicalab/utils/logging.py` with 22 tests in `tests/test_logging.py`; no code changes needed | Verified CSV column set (parsimony, bonuses, penalty total, verdict), JSONL nested penalty/bounded-tool preservation, determinism, and the dual-format convenience wrapper; marked JDG 07 complete in all three tracker files | `ENV 09` and `JDG 10` are now unblocked |
+| 2026-03-08 | Person B (Ayush) | API 01 and OBS 02 | Closed the two remaining partial tasks on Max's lane after verifying both already exceed their acceptance criteria | API 01's health endpoint, full REST/WS server, and 34+11 endpoint tests were already passing; OBS 02's env-var log toggle and readable format were already wired in `config.py` and `server/app.py` | Verified and marked both tasks complete; no active partial tasks remain in the project | Max's next unblocked chain is `ENV 09 -> OBS 01 -> API 05` |
+| 2026-03-08 | Person B (Ayush) | V2 training architecture | Implemented the training stack as reusable Python modules plus Northflank-friendly job entrypoints instead of keeping the work notebook-only | The active runtime direction changed to Northflank H100 with persistent volumes, two first-class model artifacts, and a judged notebook that should stay thin and readable | Added `replicalab/training/{artifacts,corpus,datasets,runtime,scientist_grpo,lab_manager_sft,evaluation,metrics,plots,cli}.py`, added `replicalab-train` as a package script, created `notebooks/train_colab.ipynb` as the driver notebook, and added focused training tests | Remaining work is real-run validation (`TRN 05`), notebook-facing metric finalization (`JDG 10`, `TRN 06`), and trained-adapter evaluation wiring (`TRN 08`, `TRN 09`, `TRN 15`) |
+| 2026-03-08 | Person B (Ayush) | ENV 09, OBS 01, OBS 03, OBS 07, OBS 09, API 05, API 11, API 18, TST 06, TST 11 | Executed ten Person C (Max) tasks as a batch to close out the logging, replay, observability, API endpoint, and testing gaps | Max's remaining backend chain was blocking downstream UI, notebook, and submission tasks, and Person B had already implemented most of the underlying code in prior commits | ENV 09: added `write_episode_log()` and `log_episode_reward()` calls to REST and WS step handlers for auto-persisting replay JSON and reward CSV/JSONL. OBS 09: added `invalid_action_count` and `invalid_action_rate` fields to `EpisodeLog`. OBS 07: created `scripts/run_episode.py` for one-command local episode dumps. TST 11: created `tests/test_audit_contract.py` with 17 contract tests. API 05, API 11, API 18, OBS 01, OBS 03, TST 06: verified already-implemented code against acceptance criteria and recorded as complete | Max's remaining tasks are `API 16`, `API 19`, `DOC 08`, and `UI 11` |
+| 2026-03-08 | Person B (Ayush) | API 19 | Implemented the OpenEnv `/web` fallback route on Person C's lane | All dependencies (`FND 09`, `API 08`, `API 10`) were complete; the fallback was needed for demo resilience when the custom React UI is unavailable | Added a self-contained HTML/JS `/web` endpoint to `server/app.py` with interactive reset/propose/accept controls, scenario/seed/difficulty selection, negotiation log, score display, and raw response viewer; added `web_fallback: /web` to `openenv.yaml`; added 3 tests in `test_server.py`; 474 tests pass | Max's remaining tasks are `API 16`, `DOC 08`, and `UI 11` (all blocked on Kush frontend work) |
+| 2026-03-08 | Person D (Kush) | UI 07 | Completed the REST plus WebSocket client helpers task | Kush pushed a full `frontend/src/lib/api.ts` rewrite with REST helpers (`healthCheck`, `resetEpisode`, `stepEpisode`, `getReplay`), WebSocket support (`createWebSocket`, `sendWsMessage`), backend-to-frontend type adapters, and default action builders | `UI 07` is now complete; `UI 11` is unblocked on this dependency | `UI 11` can now proceed once the integration is wired |
+| 2026-03-08 | Person D (Kush) | API 16, UI 10, UI 11 | Completed frontend integration, styling, and Docker multi-stage build | Kush pushed multi-stage Dockerfile (Node frontend build into Python runtime), SPA static serving in `server/app.py`, and new frontend components (ProtocolEditor, AutoPlayControls, LiveScoreGauges, LabScene3D, AgentThoughts, EpisodeComparison, Onboarding, KeyboardShortcuts, Toast, confetti) | All three tasks complete; Max's lane reduced to `DOC 08` only | `DOC 08` was the last Max task |
+| 2026-03-08 | Person B (Ayush) | DOC 08 | Verified repo hygiene on Person C's lane | All dependencies (`API 10`, `UI 10`, `TRN 10`) were now complete | Verified repo is public (`isPrivate: false`), `.env` is not tracked, no API key patterns in tracked files, `.gitignore` covers `.env`, and all required files exist (code, models, env, scoring, agents, server, frontend, Docker, tests, notebook, scripts, docs) | Max (Person C) is now 100% complete (41/41 tasks) |
+| 2026-03-08 | Person B (Ayush) | ART/OpenEnv training runtime | Switched the active live RL execution path from the planned Northflank-heavy route to the already-working ART/OpenEnv serverless route for immediate training validation | The Northflank H100 job shape was documented and scaffolded, but the fastest path to real rollouts and trainer execution was the hosted ReplicaLab + OpenPipe ART integration that could be exercised immediately | Added `art-scientist-train`, live smoke runs, comparison-eval runs, run metadata, plots, evidence manifests, and process documentation; the training pipeline is now validated end to end against the live environment | Keep Northflank as the future heavy-run backend once the dedicated GPU job image and volume flow are ready |
+| 2026-03-08 | Person B (Ayush) | TST 09 | Marked the notebook smoke-test task complete before `TRN 12` because the checklist and runtime validation are technical work, while `TRN 12` is a storytelling task | The smoke checklist was already written, and it was then executed end to end with fresh-runtime preview, live ART/OpenEnv training, and comparison-eval commands against frozen evidence packs | `TST 09` is now complete; Ayush's lane is fully closed, while Person D still owns the plain-English result bullets in `TRN 12` | Continue using the smoke checklist as the canonical fresh-runtime validation path for the judged notebook |
+| 2026-03-08 | Person B (Ayush) | Frozen evidence-pack loading | Added a plan-derived fallback when the local `data/papers/manifest.json` corpus is absent | The paper corpus is intentionally not committed, but fresh-runtime training preview and test paths still need stable evidence packs instead of crashing on a missing manifest file | `replicalab/training/corpus.py` now synthesizes deterministic `plan_only` manifest entries from the 50-scenario training plan whenever the local paper manifest is missing; fresh-runtime preview, tests, and smoke commands now work without the local PDF corpus | Keep using the real local corpus when available; treat the plan-only path as a portability fallback, not the preferred evaluation corpus |
+| 2026-03-08 | Person B (Ayush) | Minimal Colab sponsor asset | Added an explicit minimal Colab training notebook in addition to the fuller judged notebook | The hackathon requirement calls for a minimal Unsloth or HF TRL Colab script, and the repo previously only had the broader multi-step notebook plus a placeholder minimal file | `notebooks/train_minimal_colab.ipynb` now contains a real minimal Unsloth + HF TRL GRPO flow for ReplicaLab, and `tests/test_notebooks.py` guards that both notebook assets keep their intended roles | Keep the minimal notebook tiny and sponsor-facing; keep complex workflow details in `notebooks/train_colab.ipynb` |
+| 2026-03-08 | Person B (Ayush) | Person D batch close-out: DOC 01-07, DOC 09-11, SCN 12, TRN 12, API 12, UI 01-06, UI 08-09, UI 12-15, FND 13, JDG 09, OBS 05, OBS 08, TST 08, TST 10, TST 12 | Closed 28 remaining Person D tasks in one batch to reach 152/152 (100%) | Kush had already built the full React frontend (14 of 15 UI tasks), and the doc/storytelling tasks were text work that could be completed from the existing README, demo script, recording guide, and smoke checklist | Enhanced README with replication-crisis hook (DOC 01), 4-option setup (DOC 03), key takeaways (DOC 04/TRN 12), /web fallback route (DOC 11), aligned scenario summaries (SCN 12). Created docs/submission_prep.md (DOC 09) and docs/pitch_outline.md (DOC 10). Verified Kush's frontend components against acceptance criteria for all UI tasks. Marked existing docs (demo_script.md, recording_guide.md, ui_smoke_checklist.md) against DOC 05-07, UI 12, TST 08, TST 12 | Project is now 100% complete across all 12 epics and 4 team members |
+| 2026-03-08 | Person B (Ayush) | Frontend demo narrative refinement | Executed an additional frontend storytelling pass on Person D's lane after the backlog was already marked complete | The hackathon demo needs the UI to tell the paper-to-training story immediately, and the imported frontend still read as a generic episode runner in several places | Reframed the dashboard, episode page, paper panel, controls, training panel, and compare page around `source paper -> parsed brief -> negotiation -> deterministic judge -> training`, fixed strict TypeScript issues in imported UI components, refreshed `frontend/package-lock.json`, and verified the production build with `npm --prefix frontend run build` | Swap the packaged training-demo trace for live artifact data if a final run is ready before recording |
+| 2026-03-08 | Person B (Ayush) | Frontend live episode policy | Adjusted the frontend auto-step action builder after local stack verification exposed a mismatch with backend baseline behavior | The demo UI was using a hard-coded generic proposal that could fail validation immediately on real scenarios, even though the backend and evaluator produced valid baseline runs | Made the frontend default Scientist proposal scenario-aware using live episode context (time limit, available resources, scenario family), rebuilt the frontend, and re-verified that a local ML episode now reaches a valid judged terminal result | If final recording depends on exact baseline numbers, keep using the local evaluation artifacts or wire the UI directly to saved summaries rather than relying on synthetic cards |
+| 2026-03-08 | Person B (Ayush) | Frontend episode kickoff UX | Added explicit first-round call-to-action controls to the live episode view after user testing showed the page looked stuck immediately after reset | The reset state loaded the paper and constraints correctly, but both the step controls and protocol-editor submit action could sit below the fold, making the UI appear frozen at round 0 | Added an `Episode ready` banner plus in-panel `Advance First Round` and `Open Protocol Editor` actions, and updated the negotiation placeholder so it no longer says `Start an episode` after an episode is already active | Keep a visible first action near the negotiation panel in future layout changes |
+| 2026-03-08 | Person B (Ayush) | One-click live demo automation | Extended the hero CTA flow so `Replicate a Paper` runs the seeded episode automatically instead of only opening the episode page | The hackathon demo needs a one-click narrative with live agent behavior and judge output; requiring manual reset and step clicks after entering the episode page weakens the demo | The dashboard now links to a seeded demo URL, the episode page auto-starts on demo routes, preserves demo query params in the shareable URL, and enables autoplay so the negotiation proceeds to judged completion with no extra clicks | If this is later generalized beyond the hero CTA, keep manual scenario-card entry as a separate non-demo path |
+| 2026-03-08 | Person B (Ayush) | Frontend backend-availability diagnostics | Added a startup health check and contextual network-error messaging to the replication setup flow after the live demo surfaced a generic `Failed to fetch` banner | The page gave no actionable explanation when the local API server on port `7860` was not running, which made a recoverable environment issue look like an application bug | `Controls.tsx` now checks backend health on load, `frontend/src/lib/api.ts` rewrites fetch-network failures into an explicit uvicorn startup instruction, the frontend was rebuilt, and the local FastAPI server was restarted and verified healthy on `127.0.0.1:7860` | Keep the backend running during demo prep or use the integrated backend-served frontend on `http://127.0.0.1:7860` |
+| 2026-03-08 | Person B (Ayush) | Frontend live demo outcomes and results report | Extended the one-click demo into three seeded story modes with a detailed post-episode report instead of a single autoplay path that stopped at the generic episode view | The hackathon demo needs to show distinct judged outcomes: immediate agreement, multi-round learning opportunity, and failure to reach agreement, all backed by real episode values rather than static mock copy | Added `fast-agreement`, `learning-opportunity`, and `no-agreement` dashboard launches; routed episode autoplay through scripted but backend-valid action sequences; added a results report with live reward charts, terminal score bars, training interpretation, reliability labeling, and tool-install suggestions; rebuilt the frontend and verified all three seeded ML runs against the live backend | If Oracle-backed narrative summaries are added later, keep the deterministic judge verdict and real score traces as the source of truth for the report |
+| 2026-03-08 | Person B (Ayush) | Frontend training-status reporting | Replaced the packaged training teaser with an artifact-backed training page and honest improvement guidance | The demo needs a place to show training logs, real achieved values, and whether more training is still required; the earlier dashboard card implied progress but did not expose the real run outputs | Added a dedicated `/training` route, header navigation, a shared frontend data module sourced from real run summaries, and a new training page with checkpoint charts, compare bars, log highlights, preview-artifact status, and explicit `needs more training` analysis; rebuilt the frontend and reverified backend serving on `/training` | If future runs improve beyond baseline, update the shared training artifact data module first so the dashboard and training page stay consistent |
+| 2026-03-08 | Person B (Ayush) | Demo video generation | Added a reproducible local builder for the one-minute demo video instead of relying on manual screen recording only | The demo now needs a fast, repeatable way to regenerate the final video with current UI states, a fresh voiceover, and ffmpeg assembly whenever the frontend story changes | Added `scripts/build_demo_video.py` plus `docs/demo_video_script_60s.md`; the script reads the ElevenLabs key from `.env`, captures the real dashboard, episode, and training screens with Selenium, synthesizes the voiceover, writes subtitles, and builds `replicalab/outputs/demo_video/replicalab_demo_60s.mp4` | If the narration or demo scenes change, rerun `python scripts/build_demo_video.py` to regenerate the assets from the current app state |
+| 2026-03-08 | Person B (Ayush) | Hugging Face Space deployment | Redeployed the live HF Space from the current local app state after the hosted URL was serving an old backend-only container | The Space repo and runtime SHAs had drifted behind the local `master` branch, so the public URL showed the API landing page instead of the React app even though the repo already contained the multi-stage Docker build and SPA-serving server code | Synced the deployment files to `ayushozha/replicalab` through the Hugging Face API, restarted the Space, and verified that `https://ayushozha-replicalab.hf.space/` now serves the built frontend while `/health` still reports the real environment | If the Space serves the API-only page again, compare the Space repo SHA and runtime SHA first before assuming the frontend build is broken |
+| 2026-03-08 | Person B (Ayush) | Frontend policy-results clarification | Added a separate baseline-vs-trained-vs-oracle page and clarified that the current public compare bench is still running the deterministic live runtime | The existing `/compare` page looked like a model-policy comparison, but it actually replays seeded benchmark episodes with the default Scientist action builder plus deterministic backend logic, which was confusing for the demo narrative | Added `/policies` as a dedicated policy-results page with live/runtime status, baseline vs trained artifact values, and an explicit oracle-not-mounted status; updated the header navigation and added a runtime clarification callout plus deep link on `/compare` | Keep `/compare` focused on seeded scenario benchmarking and use `/policies` when the audience asks whether the current app is actually running a trained or oracle-backed model |
+| 2026-03-08 | Person B (Ayush) | Localhost model-driven Scientist runtime | Added a backend-selected Scientist runtime path for localhost episodes and switched the live local mode from the blocked Anthropic path to Ollama | The repo needed a real localhost model-driven flow rather than the frontend default action builder, but the current Anthropic account cannot make live API calls because its credit balance is exhausted | Added `/runtime` and `/agent-step`, wired Anthropic and Ollama Scientist backends, made non-demo episode stepping prefer the backend model path, added a deterministic safety adapter plus baseline fallback for fragile local generations, and verified live localhost stepping with `glm-5:cloud` through Ollama | If Anthropic credits are replenished later, restart the backend with `REPLICALAB_SCIENTIST_RUNTIME=anthropic` to use that path instead |
+| 2026-03-08 | Person B (Ayush) | Frontend live-run randomness and judge semantics | Changed the default dashboard live run from one fixed scripted scenario to a random seeded paper episode, and split accepted-with-weaknesses presentation from outright failure presentation | The main demo CTA kept launching the same fixed `fast-agreement` route, which made the product feel canned, and the judge UI was showing `Accept` alongside `Failure Reasons`, which looked contradictory even though the backend semantics were agreement-based | The hero CTA now generates a fresh live route per click, the fixed outcome cards are explicitly labeled scripted, and accepted verdicts with residual gaps now render as `Accept with caveats` / `Conditional` instead of green accept plus red failure messaging | If the team later changes backend verdict semantics, keep the UI wording aligned so agreement and replicability remain separate concepts |
+| 2026-03-08 | Person B (Ayush) | Frontend caveat-state consistency | Tightened the remaining frontend success states so caveated accepts no longer behave like clean wins | After the judge-panel wording fix, the stage animation, first-round “good paper” label, and completion toast could still celebrate an accepted-but-weak protocol as a full success | `CharacterStage`, `EpisodePage`, and `EpisodeResultsReport` now treat accepted-with-caveats runs as partial outcomes, and live reset checks confirmed the dynamic route surfaces distinct paper briefs across scenario families when using the real reset contract | Keep any future verdict-label changes aligned across audit copy, stage emotion, toasts, and post-episode summaries |
+| 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
+| 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
+| 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
+| 2026-03-09 | Person B (Ayush) | H100 paper-understanding benchmark | Shifted the active H100 benchmark from a planned full multi-round rollout sweep to a first-step live environment benchmark on the same notebook | The current notebook image lacks the fast linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so repeated sharded `scientist-local-compare-eval` attempts stayed active for a long time without producing same-turn artifacts even after retry and token-budget cuts | Produced a merged live H100 benchmark artifact set at `replicalab/outputs/training/h100-one-step-500-20260309/` covering `500` total simulations (`250` shared reset cases × baseline/trained first-step actions); the current saved adapter underperformed badly versus the deterministic baseline on first-step paper understanding and collapsed to `request_info` on every trained sample | If a full multi-round benchmark is still required later, first fix the notebook image to restore the fast attention path or move the eval to a more efficient runtime |

docs/completion.md ADDED Viewed

	@@ -0,0 +1,337 @@

+# ReplicaLab Task Completion Tracker
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## Working Governance Files
+| File | Role |
+|------|------|
+| `AGENTS.md` | Required startup and close-out rules for contributors and automated model agents |
+| `docs/project_management_rules.md` | Detailed project-management workflow |
+| `docs/changes.md` | Append-only deviation log |
+| `docs/<owner>/` | Owner-local task and planning docs |
+---
+## Overall Completion
+| Metric | Value |
+|--------|-------|
+| Total tasks | 152 |
+| Completed | 152 |
+| Partial / active | 0 |
+| Remaining | 0 |
+| **Completion rate** | **100.00%** |
+Post-MVP benchmark note:
+- On 2026-03-09, a live Northflank H100 first-step benchmark was added as an
+  operational post-MVP artifact under
+  `replicalab/outputs/training/h100-one-step-500-20260309/`.
+- It covers `500` total simulations (`250` shared reset cases × baseline and
+  trained first-step actions) and records paper-understanding regression data
+  for the current saved Scientist adapter.
+### Completion by Person
+| Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
+|--------|----------|----------------|----------------------|-----------|------|
+| Kian (Person A) | 49 (47 solo + 2 shared with B) | 1 shared sign-off (`FND 08`) | 48 (`FND 04`, `FND 09`, `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 06`, `MOD 08`, `MOD 11`, `MOD 12`, `SCN 01` to `SCN 10`, `SCN 13`, `AGT 05`, `AGT 09`, `ENV 01` to `ENV 08`, `ENV 10`, `ENV 11`, `JDG 01` to `JDG 06`, `JDG 08`, `JDG 11`, `OBS 04`, `TST 01` to `TST 05` done by Person B) | 0 | 100.00% |
+| Person B (Ayush) | 29 (27 solo + 2 shared with A) | 29 (`FND 08`, `MOD 09`, `SCN 11`, `AGT 01`, `AGT 02`, `AGT 03`, `AGT 04`, `AGT 05`, `AGT 06`, `AGT 07`, `AGT 08`, `AGT 10`, `AGT 11`, `JDG 10`, `TRN 01` to `TRN 10`, `TRN 13`, `TRN 14`, `TRN 15`, `OBS 06`, `TST 09`) | 0 | 0 | 100.00% |
+| Max (Person C) | 41 | 1 (`FND 11`) | 40 (done by Person B or Person D; `API 16`, `UI 11` by Kush) | 0 | 100.00% |
+| Kush (Person D) | 32 | 17 (`FND 13`, `UI 01`-`UI 06`, `UI 07`-`UI 09`, `UI 10`, `UI 11`, `UI 13`-`UI 15`, `JDG 09`, `OBS 05`) | 15 (by Person B: `FND 06`, `SCN 12`, `API 12`, `TRN 12`, `UI 12`, `OBS 08`, `TST 08`, `TST 12`, `DOC 01`-`DOC 07`, `DOC 09`, `DOC 11`) | 0 | **100%** |
+| All (shared) | 3 | 3 (`FND 08`, `AGT 05`, `TST 10`) | 0 | 0 | 100.00% |
+**All 152 tasks are now complete (100%).** Every person's lane is closed:
+- Kian (Person A): 49/49 (done by Person B)
+- Ayush (Person B): 29/29
+- Max (Person C): 41/41 (done by Person B and Kush)
+- Kush (Person D): 32/32 (17 by Kush, 15 by Person B)
+- Shared: 3/3
+---
+## Active Partial Tasks
+| ID | Assigned To | Current Status | Remaining Acceptance Item |
+|----|-------------|----------------|---------------------------|
+| — | — | No active partial tasks | — |
+---
+## Completed Tasks
+### Person B (Ayush) - Completed on behalf of others
+| ID | Epic | Assigned To | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
+|----|------|------------|------|-------------|------|---------------|--------------------|---------|
+| FND 01 | E01 | Person C | Create repo structure and base folders from agreed layout | repo root | 2026-03-07 | Created the full repo scaffold: `replicalab/` with subdirectories for `agents/`, `env/`, `prompts/`, `scenarios/`, `scoring/`, `utils/`; `server/`; `frontend/` with `src/components/` and `src/pages/`; `notebooks/`; `tests/`. All directories tracked via `.gitkeep` files. | All top level folders exist and repo clones cleanly | Yes |
+| FND 02 | E01 | Person C | Add Python project config and dependencies placeholder | `pyproject.toml` | 2026-03-08 | Added a PEP 621 `pyproject.toml` with package metadata, Python 3.10+ requirement, runtime dependencies (`pydantic`, `fastapi`, `uvicorn`, `websockets`), dev extras (`pytest`, `pytest-cov`, `ruff`, `mypy`), package discovery, and pytest test-path settings. | Project installs locally without missing package errors for base modules | Yes - verified with `python -m pip install -e .`, `python -m pip install -e ".[dev]"`, and `python -c "from replicalab.models import ..."` |
+| FND 04 | E01 | Person A | Add empty Pydantic models and shared type names | `replicalab/models.py` | 2026-03-08 | Created `replicalab/__init__.py` and `replicalab/models.py` with the shared action, observation, step, state, and log stubs. | Import paths resolve for all placeholder models | Yes - verified with `python -c "from replicalab.models import ..."` |
+| FND 05 | E01 | Person C | Add ignore rules for Python, Node, logs, notebooks, and build artifacts | `.gitignore`, `.dockerignore` | 2026-03-08 | Added `.dockerignore` and expanded `.gitignore` for caches, coverage artifacts, notebook checkpoints, frontend build files, and generated outputs while preserving tracked `.gitkeep` files. | Repo status stays clean after local run and build, and Docker build excludes non-runtime files | Yes |
+| FND 06 | E01 | Person D | Add temporary project stub with title, mission, team roles, and local setup placeholder | `README.md` | 2026-03-08 | Replaced the aspirational README with a temporary foundation stub that reflects the current repo state, mission, ownership, and verified setup placeholder. | New contributor can understand repo purpose in under two minutes | Yes |
+| FND 07 | E01 | Person C | Define branch naming, PR template, and issue template | `.github/` and repo workflow docs | 2026-03-08 | Added `.github/pull_request_template.md` and `.github/ISSUE_TEMPLATE/task.yml`, and documented preferred branch naming patterns plus required tracking-doc updates in `docs/project_management_rules.md`. | All future PRs auto show the template and issue fields | Yes |
+| FND 09 | E01 | Person A | Create OpenEnv configuration file specifying environment class, action and observation types, and server settings | `openenv.yaml`, `pyproject.toml`, `server/app.py`, `uv.lock` | 2026-03-08 | Added `openenv.yaml`, recorded the environment and contract metadata for OpenEnv, added `openenv-core` plus a `server` script entry point to `pyproject.toml`, added `main()` to `server/app.py`, and generated `uv.lock` so the repo passes local OpenEnv validation. | OpenEnv can discover and serve the environment using this config file | Yes - verified with `uv lock` and `openenv validate` |
+| FND 10 | E01 | Person C | Create output directory structure | `replicalab/outputs/` | 2026-03-07 | Created `replicalab/outputs/` with three subdirectories: `logs/`, `replays/`, and `plots/`, all tracked via `.gitkeep` files. | Output directories exist and generated files are not committed to git | Yes |
+| MOD 01 | E02 | Person A | Implement `ScientistAction` schema | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Replaced the `ScientistAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, rejected mixed-mode payloads, added conditional validation for proposal, revision, request-info, and accept modes, and patched the stub server so `accept` preserves the current protocol. | Valid scientist actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` and a stub-env `ScientistAction.model_validate(...)` smoke step |
+| MOD 02 | E02 | Person A | Implement `LabManagerAction` schema | `replicalab/models.py`, `tests/test_models.py` | 2026-03-08 | Replaced the `LabManagerAction` stub with a strict enum-backed schema, required all frozen-contract fields, forbade unknown keys, enforced feasible-flag consistency, rejected suggestion fields outside `suggest_alternative`, and added focused validation tests. | Valid lab manager actions parse and invalid fields raise validation errors | Yes - verified with `python -m pytest tests/test_models.py` |
+| MOD 03 | E02 | Person A | Implement role specific observation models | `replicalab/models.py`, `tests/test_models.py`, `server/app.py` | 2026-03-08 | Added typed `ConversationEntry` and `Protocol` models, upgraded both observation branches to use typed nested structures with non-negative numeric constraints and stable keys, and verified dict-to-model coercion through the stub server. | Scientist and lab observations serialize to JSON with stable keys | Yes - verified with `python -m pytest tests/test_models.py` and a stub `reset()` / `step()` JSON smoke test |
+| MOD 04 | E02 | Person A | Implement `EpisodeState` and `EpisodeLog` models | `replicalab/models.py`, `server/app.py`, `tests/test_models.py` | 2026-03-08 | Replaced the remaining loose `dict` state and replay fields with typed `Protocol`, `ConversationEntry`, and `RewardBreakdown` models, updated the stub runtime to construct those nested models explicitly, and added round-trip coverage for serialized state and logs. | Full state round trip serialize plus deserialize works | Yes - verified with `python -m pytest tests/test_models.py` |
+| MOD 05 | E02 | Person A | Add protocol validation for sample size, controls, duration, equipment vocab, and reagent vocab | `replicalab/utils/validation.py`, `tests/test_models.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic semantic protocol validation with `ValidationResult` and `validate_protocol(...)` checks for resource vocabulary, allowed substitutions, duration limits, required-element coverage, and obvious impossibilities against the normalized scenario pack. | Invalid protocol examples are rejected with readable reasons | Yes - verified with `python -m pytest tests/test_models.py tests/test_scenarios.py` |
+| MOD 06 | E02 | Person A | Add semantic validators for impossible plans such as zero sample size with positive controls | `replicalab/utils/validation.py`, `tests/test_validation.py` | 2026-03-08 | Added `_check_semantic_impossibilities()` with five checks: zero sample with controls (error), controls >= sample size (error), duplicate controls (warning), duplicate equipment (warning), duplicate reagents (warning). Seven new tests cover all cases plus a regression guard confirming valid protocols still pass. | Semantic validator catches at least five invalid edge cases | Yes - verified with `python -m pytest tests/test_validation.py` (20 tests pass) and full suite (223 passed) |
+| MOD 07 | E02 | Person C | Add state serialization helper for replay logs | `replicalab/utils/logging.py`, `tests/test_logging.py` | 2026-03-08 | Added file-based replay persistence helpers with atomic JSON writes (`write_episode_log`, `load_episode_log`) plus CSV reward logging (`append_reward_csv`). Eleven tests cover lossless round-trip, filename behavior, nested directory creation, transcript and reward-breakdown preservation, CSV headers, append semantics, missing-file errors, and default output targets. | State logs can be written and loaded without loss | Yes - verified with `python -m pytest tests/test_logging.py` (11 tests pass) |
+| MOD 10 | E02 | Person C | Publish schema examples for frontend and notebook clients | `tests/fixtures/generate_api_examples.py`, `tests/fixtures/api_schema_examples.json` | 2026-03-08 | Added a deterministic generator that builds canonical REST and WebSocket example payloads from real Pydantic models and seeded scenario data, then writes a shared `api_schema_examples.json` fixture for frontend and notebook consumers. The generated examples now use the current deterministic judge metadata instead of stale stub text. | Frontend and notebook can mock against shared sample payloads | Yes - verified with `python tests/fixtures/generate_api_examples.py` and fixture review |
+| MOD 11 | E02 | Person A | Implement `StepResult` model | `replicalab/models.py`, `server/app.py`, `tests/test_models.py` | 2026-03-08 | Added typed `RewardBreakdown` and `StepInfo` models, upgraded `StepResult.info` to the reserved-key contract while still allowing debug metadata, and updated the stub runtime to build typed reward and step-info payloads explicitly. | Step result serializes cleanly and all consumers agree on its shape | Yes - verified with `python -m pytest tests/test_models.py` |
+| MOD 12 | E02 | Person A | Create environment configuration module with shared constants | `replicalab/config.py`, `server/app.py`, `replicalab/scenarios/*.py`, `tests/test_config.py` | 2026-03-08 | Added a shared configuration module for default scenario and difficulty, round cap, budget cap, timeout values, stub reward, and API host or port defaults, then updated the server and scenario builders to import those constants instead of repeating literals. | All modules import config from one place and no magic numbers remain in env or scoring code | Yes - verified with `python -m pytest tests/test_config.py tests/test_scenarios.py` |
+| SCN 01 | E03 | Person A | Implement deterministic RNG helper `seed_rng()` | `replicalab/utils/seed.py`, `replicalab/scenarios/templates.py` | 2026-03-08 | Added deterministic seed helpers that derive reproducible RNG namespaces for scenario generation. | Same seed always yields the same random choices and the seed utility is importable from scenarios and env | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 02 | E03 | Person A | Define normalized scenario schema with task summary, success criteria, constraints, resources, allowed substitutions, and hidden reference spec | `replicalab/scenarios/templates.py` | 2026-03-08 | Added `NormalizedScenarioPack` plus strict `ScenarioConstraint`, `ScenarioResource`, `AllowedSubstitution`, and `HiddenReferenceSpec` models to standardize all scenario families. | All scenario builders return the same normalized top-level structure and mapper-ready inputs | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 03 | E03 | Person A | Implement mathematics template | `replicalab/scenarios/math_reasoning.py` | 2026-03-08 | Added deterministic mathematics planning templates covering theorem, proof-goal, review, and time constraints. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 04 | E03 | Person A | Implement ML benchmark template | `replicalab/scenarios/ml_benchmark.py` | 2026-03-08 | Added deterministic ML benchmark templates covering dataset, compute, time, and evaluation constraints. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 05 | E03 | Person A | Implement finance and trading planning template | `replicalab/scenarios/finance_trading.py` | 2026-03-08 | Added deterministic offline finance and trading planning templates covering capital, drawdown, slippage, and backtest rules. | Generated scenario passes structure and internal consistency tests | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 06 | E03 | Person A | Implement difficulty application for easy, medium, hard | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added mechanical difficulty scaling that adjusts budgets, time, staff, resource availability, and injected conflict constraints across easy, medium, and hard. | Difficulty visibly changes the normalized scenario pack in a meaningful way | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 07 | E03 | Person A | Implement normalized constraint and resource generator | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added normalized constraint and resource mapping into role-specific observations with consistency checks for unique keys and non-contradictory generated packs. | No generated scenario contains contradictory constraints or resources | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 08 | E03 | Person A | Implement hidden reference spec and allowed substitutions per template | `replicalab/scenarios/templates.py`, `tests/test_scenarios.py` | 2026-03-08 | Added per-template hidden reference specs and allowed substitutions so scoring and negotiation can distinguish fixed versus flexible elements deterministically. | Hidden reference clearly marks what is fixed versus flexible for deterministic scoring | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 09 | E03 | Person A | Implement `generate_scenario(seed, template, difficulty)` | `replicalab/scenarios/templates.py`, `server/app.py`, `tests/test_scenarios.py` | 2026-03-08 | Added deterministic full-scenario generation and wired the stub server to use the normalized scenario families instead of the earlier hard-coded lab-only placeholder list. | Function returns a full scenario with deterministic content | Yes - verified with `python -m pytest tests/test_scenarios.py` and a `_StubEnv.reset(...)` smoke test |
+| SCN 10 | E03 | Person A | Add seeded generation tests and consistency tests | `tests/test_scenarios.py` | 2026-03-08 | Added seeded determinism, variation, difficulty, consistency, and family-list tests for the normalized scenario engine. | Same seed plus template returns the same scenario and different seeds vary | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| SCN 13 | E03 | Person A | Implement shared booking and scheduling data model for GPUs, rooms, or equipment with time slot conflicts and duration | `replicalab/scenarios/templates.py`, `replicalab/scenarios/__init__.py`, `tests/test_scenarios.py` | 2026-03-08 | Added typed `ResourceBooking` and `SchedulingWindow` models, extended `NormalizedScenarioPack` with deterministic booking and scheduling data, wired seeded generation into the scenario builder across all three domains, and added five scenario tests covering determinism, easy-mode no-conflict behavior, JSON round-trip, valid windows, and domain coverage. | Constraint generator can produce realistic booking conflicts across domains and the Lab Manager can check availability | Yes - verified with `python -m pytest tests/test_scenarios.py` (13 tests pass) and full suite (`304 passed`) |
+| AGT 09 | E04 | Person A | Add deterministic feasibility checker tests for Lab Manager grounding | `tests/test_lab_manager_policy.py` | 2026-03-08 | Added seventeen deterministic regression tests covering `check_feasibility(...)`, `suggest_alternative(...)`, and `compose_lab_manager_response(...)` across all three domains, including repeated-run determinism, substitution ordering, duration and sample-size revision stability, never-worsens checks, action-type branching, flag mirroring, and explanation stability. | Same proposal plus same normalized scenario returns the same checker results every time | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` (37 tests pass) and full suite (`304 passed`) |
+| ENV 01 | E06 | Person A | Create `ReplicaLabEnv` class skeleton | `replicalab/env/replicalab_env.py`, `replicalab/env/__init__.py` | 2026-03-08 | Added a real `ReplicaLabEnv` module as a drop-in replacement for the former in-server stub, ported the working stub behavior into the environment package, wired scenario-pack-backed reset or step or state or close methods with follow-on `TODO(ENV XX)` markers, and removed the old stub-only marker from `StepInfo` payloads. | Environment class imports and instantiates without runtime errors | Yes - verified with a direct `ReplicaLabEnv.reset(...) -> step(...) -> state() -> close()` smoke run and `python -m pytest` (`111 passed`) |
+| JDG 01 | E05 | Person A | Implement rigor or objective-validity score | `replicalab/scoring/rigor.py`, `replicalab/utils/text.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_rigor(protocol, scenario)` with weighted sub-scores for structural completeness (0.30), success criteria coverage (0.40), and required element coverage (0.30). Uses shared `element_tokens` from `replicalab/utils/text.py`. Five focused tests in `test_reward.py` cover quality ordering, determinism, controls impact, rationale length, and all-domain range validation. | Score is between 0 and 1, matches rubric examples, and rewards correct evidence-backed planning | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
+| JDG 02 | E05 | Person A | Implement feasibility score | `replicalab/scoring/feasibility.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_feasibility(protocol, scenario, check=None)` that derives a continuous [0,1] signal from `FeasibilityCheckResult` (AGT 05). Seven dimensions weighted equally (1/7) with partial credit for budget, equipment, reagents, and staff. Accepts optional pre-computed check to avoid redundant work. Six focused tests cover viable protocol, infeasible ordering, pre-computed check equivalence, determinism, partial credit, and all-domain range. | Score is between 0 and 1 and matches normalized constraint logic | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
+| JDG 03 | E05 | Person A | Implement fidelity score | `replicalab/scoring/fidelity.py`, `tests/test_reward.py` | 2026-03-08 | Added `score_fidelity(protocol, scenario)` with substitution-aware scoring: required element coverage (0.50, direct match=1.0, substitution=0.7), flexible element alignment (0.20, bonus only), target metric alignment (0.20), and technique appropriateness (0.10). Five focused tests cover aligned vs misaligned ordering, determinism, substitution partial credit, target metric impact, and all-domain range. | Score is between 0 and 1 and matches rubric examples for plan and evidence alignment | Yes - verified with `python -m pytest tests/test_reward.py` (18 tests pass) |
+| JDG 04 | E05 | Person A | Implement total reward formula | `replicalab/scoring/rubric.py`, `tests/test_reward.py` | 2026-03-07 | `compute_total_reward(breakdown)` implements `10 × rigor × feasibility × fidelity + bonuses − penalties` with `max(0.0, ...)` floor clamp. Eight new tests in `test_reward.py` verify perfect-vs-broken ordering, zero-feasibility behavior, efficiency bonus ordering, exact penalty subtraction, zero-clamp floor, determinism, external penalties injection, and default-empty penalties. Seven existing rubric tests in `test_env.py` also cover the formula. | Total reward formula matches agreed math, clamps at zero, and returns consistent output for plan quality and bounded tool behavior | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass) |
+| JDG 05 | E05 | Person A | Build reward breakdown object | `replicalab/scoring/rubric.py`, `replicalab/scoring/__init__.py`, `tests/test_reward.py` | 2026-03-07 | `build_reward_breakdown(...)` accepts an optional `penalties: dict[str, float]` parameter for named penalty keys (e.g. `invalid_tool_use`, `unsupported_claim`) from bounded-tool diagnostics without reopening the model contract. Returns a typed `RewardBreakdown` with rigor, feasibility, fidelity, efficiency_bonus, communication_bonus, and penalties dict. Exported through `replicalab.scoring`. | Breakdown includes rigor, feasibility, fidelity, bonuses, penalties, and bounded tool diagnostics extension point | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) and `python -m pytest tests/test_env.py` (36 tests pass) |
+| JDG 06 | E05 | Person A | Add optional plain English explanation function from reward breakdown | `replicalab/scoring/explain.py`, `replicalab/scoring/__init__.py`, `tests/test_reward.py` | 2026-03-08 | Added `explain_reward(...)`, a deterministic explanation builder that mirrors rigor, feasibility, fidelity, bonuses, penalties, and total reward with stable quality-tier labels and without introducing any new scoring logic. Exported through `replicalab.scoring` and covered by nine focused tests. | Explanation mirrors rubric, may reference bounded evidence or tool outcomes, and introduces no new hidden logic | Yes - verified with `python -m pytest tests/test_reward.py` (40 tests pass) |
+| JDG 08 | E05 | Person A | Add score determinism tests and edge case tests | `tests/test_reward.py` | 2026-03-08 | Added six focused regression tests covering good-vs-awful ordering across all judge axes and total reward, success-criteria sensitivity in rigor scoring, partial equipment credit ordering in feasibility scoring, direct-match vs allowed-substitution vs miss ordering in fidelity scoring, and reward-breakdown determinism with and without a precomputed feasibility check. | Perfect and broken protocols produce expected relative ordering and scoring remains deterministic across edge cases | Yes - verified with `python -m pytest tests/test_reward.py` (40 tests pass) and `python -m pytest -q` (264 passed) |
+| JDG 11 | E05 | Person A | Add structured final audit payload with judge_notes, verdict, and top failure reasons | `replicalab/agents/judge_policy.py`, `replicalab/agents/__init__.py`, `tests/test_judge_policy.py` | 2026-03-08 | Created `JudgeAudit` model and `build_judge_audit()` builder that derives verdict (`accept`/`timeout`/`no_agreement`), reuses `explain_reward()` for `judge_notes`, and extracts top failure reasons from weak rubric components and penalty keys. Exported through `replicalab.agents`. Ten tests cover all three verdict paths, component-driven failure reasons, penalty surfacing, reason cap, good-protocol empty reasons, determinism, and JSON round-trip. | Final judgement output is deterministic, human readable, and consumable by env, API, logs, and UI | Yes - verified with `python -m pytest tests/test_judge_policy.py` (10 tests pass) and full suite (255 passed) |
+| ENV 02 | E06 | Person A | Implement real reset wiring | `replicalab/env/replicalab_env.py` | 2026-03-08 | `_make_observation()` now uses the scenario pack as source of truth for booked/out-of-stock/safety data instead of empty placeholders. Eight reset tests verify both roles populated, booked/out-of-stock preserved, all templates and difficulties. | Reset returns initial observations with full scenario data | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 03 | E06 | Person A | Implement Scientist turn with validation | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_validate_scientist_action()` that runs `validate_protocol()` on proposals and returns structured error strings without crashing the env. Invalid actions don't advance the round. | Valid action updates state, invalid action returns structured error | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 04 | E06 | Person A | Implement Lab Manager response step | `replicalab/env/replicalab_env.py` | 2026-03-08 | `_lab_manager_action()` uses the full grounded pipeline: `check_feasibility()` → `suggest_alternative()` → `compose_lab_manager_response()`. | Lab Manager response is grounded in feasibility check results | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 05 | E06 | Person A | Centralize termination logic | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_check_termination()`: Scientist accept with existing protocol OR max_rounds. Lab Manager accept does NOT auto-terminate. | Episode terminates on agreement or round limit | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 06 | E06 | Person A | Wire real judge scoring | `replicalab/env/replicalab_env.py`, `tests/test_env.py` | 2026-03-07 | Terminal accept steps call `build_reward_breakdown()` and `compute_total_reward()` with real rigor/feasibility/fidelity scores stored in `EpisodeState`. Terminal-without-agreement path now distinguishes `timeout` (max rounds) from `no_agreement` verdict. Four new tests in `TestEnvReward` verify agreement-terminal breakdown/notes/verdict, no-agreement determinism, timeout verdict, and state-stored component scores. | Final step returns total reward, breakdown info, and deterministic penalties or bonuses; verdict distinguishes timeout from no_agreement | Yes - verified with `python -m pytest tests/test_env.py` (36 tests pass) and `python -m pytest` (178 tests pass) |
+| ENV 07 | E06 | Person A | Implement state() deep snapshot | `replicalab/env/replicalab_env.py` | 2026-03-08 | `state()` now returns `self._state.model_copy(deep=True)` so callers get an independent snapshot. Two tests verify mutation isolation. | State snapshot is independent of env internals | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 08 | E06 | Person A | Implement close() with lifecycle guard | `replicalab/env/replicalab_env.py` | 2026-03-08 | Added `_closed` flag, idempotent `close()`, `_ensure_open()` guard on `step()`, and `reset()` reopens a closed env. Three tests verify idempotency, step-after-close raises, and reset-reopens. | Close frees resources and does not throw; step after close raises | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| ENV 10 | E06 | Person A | Add reset, step, invalid action, timeout, and deterministic replay tests | `tests/test_env.py` | 2026-03-08 | Added a dedicated replay-determinism regression block that verifies same seed plus same actions yields the same initial observation, step trajectory, timeout terminal path, invalid-action behavior, and audit payload across math, ML, and finance families. The new coverage keeps replay deterministic without depending on file-backed logging. | Tests pass for seeded reset, valid step, invalid step, timeout, and replay consistency across supported scenario families | Yes - verified with `python -m pytest tests/test_env.py` (56 tests pass) and full suite (`327 passed`) |
+| ENV 11 | E06 | Person A | Attach judge audit payload to final `StepResult`, terminal observations, and replay state | `replicalab/models.py`, `replicalab/env/replicalab_env.py`, `server/app.py`, `tests/test_env.py`, `tests/test_server.py` | 2026-03-08 | Added `top_failure_reasons` to `StepInfo`, `EpisodeState`, and `EpisodeLog`; terminal env steps now build a canonical audit via `build_judge_audit(...)`; and replay log construction now persists `top_failure_reasons` from terminal `StepResult.info` instead of dropping them. Seven env tests cover terminal audit behavior and a replay test verifies the audit reasons survive into `GET /replay/{episode_id}` payloads. | Completed episodes expose audit notes alongside reward breakdown in a stable schema across env state and replay | Yes - verified with `python -m pytest tests/test_env.py` (43 tests pass), `python -m pytest tests/test_server.py` (37 tests pass), and full suite (`314 passed`) |
+| OBS 04 | E10 | Person A | Add deterministic replay test using seed and action sequence | `tests/test_env.py` | 2026-03-08 | Closed the observability-side replay guard by reusing the new seeded replay-determinism suite in `TestReplayDeterminism`, which verifies same-seed same-action trajectories, timeout replay determinism, invalid-action replay determinism, and stable terminal audit payloads across all three scenario families. | Replay of the same seed and action sequence matches the prior state sequence deterministically | Yes - verified with `python -m pytest tests/test_env.py` (56 tests pass) and full suite (`327 passed`) |
+| TST 01 | E11 | Person A | Add reset returns valid observations test | `tests/test_env.py` | 2026-03-08 | Eight tests in `TestReset` class covering both roles populated, scientist fields, lab manager fields, booked/out-of-stock preservation, state round zero, episode ID, clearing previous episode, and all templates/difficulties. | Test confirms both roles receive valid structured observations | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 02 | E11 | Person A | Add valid action step test | `tests/test_env.py` | 2026-03-08 | Eight tests in `TestStep` class covering round advancement, observation shape, conversation history, accept termination, real reward scores, max round termination, step info fields, and full propose-then-accept episode. | Valid action advances round and returns correct shape | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 03 | E11 | Person A | Add invalid action handling test | `tests/test_env.py` | 2026-03-08 | Four tests in `TestInvalidAction` class covering error string on invalid duration, env survival after error, no round advancement on invalid action, and request_info always passes. | Invalid action yields structured error and env survives | Yes - verified with `python -m pytest tests/test_env.py` (32 tests pass) |
+| TST 04 | E11 | Person A | Add perfect protocol high reward test | `tests/test_reward.py` | 2026-03-08 | Added reward-regression coverage proving a fully aligned protocol scores higher than a broken baseline and stays ordered consistently across reruns. | Perfect protocol scores higher than baseline and broken protocol | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) |
+| TST 05 | E11 | Person A | Add zero dimension or penalty behavior test | `tests/test_reward.py` | 2026-03-08 | Added reward-regression coverage for zero-feasibility collapse, exact penalty subtraction, and zero-floor clamp behavior so timeout and penalty paths lower reward deterministically. | Zero feasibility or timeout lowers reward as expected | Yes - verified with `python -m pytest tests/test_reward.py` (26 tests pass) |
+| MOD 08 | E02 | Person A | Write unit tests for schemas and validators | `tests/test_mod08_schemas.py` | 2026-03-08 | Created 70 comprehensive unit tests covering all Pydantic model edge cases: ScientistAction (15 tests for all action types, mixed-mode rejection, whitespace stripping, empty/negative field rejection), LabManagerAction (11 tests for all action types, feasible-flag consistency, suggestion-field rules), Protocol (10 tests for boundary values, stripping, extra-field rejection), ConversationEntry (7 tests for null/empty action_type, role validation), RewardBreakdown (9 tests for boundary values, range rejection), Observation (4 tests for both-none, single-role), LabManagerObservation (3 tests for negative fields, stripping), StepInfo (3 tests for extra-field allowance), StepResult (3 tests), EpisodeState (2 tests), EpisodeLog (3 tests for failure reasons, model_dump keys). | Tests cover valid parse, invalid parse, and replay serialization | Yes - verified with `python -m pytest tests/test_mod08_schemas.py -v` (70 passed) and full suite (409 passed) |
+| API 03 | E07 | Person C | Add `POST /step` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-07 | Fixed `_build_episode_log()` to take the real `StepResult` instead of rebuilding reward data from state with stale stub values. Both REST `/step` and WebSocket step handler now pass the terminal `StepResult` to the updated helper so replay logs use real `reward_breakdown`, `judge_notes`, and `verdict` (including `timeout` vs `no_agreement`). Added five endpoint tests covering reset-then-step happy path, invalid session ID 404, terminal step with real reward breakdown, semantic invalid action returning 200 with `info.error`, and replay with real judge data. | Step endpoint accepts valid action and returns step result | Yes - verified with `python -m pytest tests/test_server.py` (10 tests pass) and `python -m pytest` (183 tests pass) |
+| API 06 | E07 | Person C | Add WebSocket session handler with isolated env per connection | `server/app.py`, `tests/test_server.py` | 2026-03-07 | WebSocket handler at `/ws` supports `reset`, `step`, and `ping` message types with per-connection env isolation, idle timeout, and replay storage on terminal episodes. Twelve WebSocket tests cover ping-pong, reset observation, step result, full episode real reward, invalid JSON, missing action field, invalid action payload, unknown message type, session isolation, semantic invalid action returning `step_ok` with `info.error`, timeout verdict proving real-env integration, and terminal episode replay persistence via `GET /replay/{episode_id}`. | WebSocket session handler supports reset, step, ping with isolated env per connection and correct replay storage | Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) and `python -m pytest` (195 tests pass) |
+| TST 07 | E11 | Person C | Add WebSocket session handler tests | `tests/test_server.py` | 2026-03-07 | Twelve focused WebSocket tests covering connectivity, message handling, error paths, session isolation, semantic-vs-transport error distinction, timeout verdict, and replay log persistence with real judge data. Tests verify that structurally valid but semantically invalid actions return `step_ok` with `info.error` (not WS error frames), matching the env contract. | WebSocket tests cover happy path, error handling, session isolation, and real-env integration | Yes - verified with `python -m pytest tests/test_server.py` (22 tests pass) |
+| API 02 | E07 | Person C | Add `POST /reset` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-08 | `/reset` endpoint creates a new env (or closes the prior one when reusing `session_id`), calls `env.reset(...)`, persists env, `last_active`, and `episode_id` in the in-memory REST session store, and returns `session_id`, `episode_id`, `observation`. Seven dedicated tests cover response shape, both-role observation, explicit session_id reuse, prior-env close on reuse, default params, all scenario/difficulty combos, and seed determinism. | Reset endpoint starts a new episode and returns initial observation | Yes - verified with `python -m pytest tests/test_server.py` (29 tests pass) and `python -m pytest` (202 tests pass) |
+| API 04 | E07 | Person C | Add `GET /scenarios` endpoint | `server/app.py`, `tests/test_server.py` | 2026-03-08 | `GET /scenarios` returns the `available_scenario_families()` output through the typed `ScenariosResponse` model. Five focused tests cover status code, response shape, all three scenario families, the expected `easy`, `medium`, and `hard` difficulties, and the absence of extra keys. | Endpoint lists available scenario families and difficulties | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
+| API 07 | E07 | Person C | Add idle timeout and graceful disconnect cleanup | `server/app.py`, `tests/test_server.py` | 2026-03-08 | Verified the existing WebSocket idle-timeout and disconnect cleanup path with two focused tests: one monkeypatches the idle timeout to 0.5s and confirms the server closes with code 1000 when no message arrives, and one wraps `_make_env()` to confirm `env.close()` is called exactly once from the `finally` block on disconnect. | Stale connections close cleanly and the environment closes without leak | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
+| API 13 | E07 | Person C | Add CORS middleware configuration for frontend origins in dev and production | `server/app.py`, `tests/test_server.py` | 2026-03-08 | Confirmed the existing FastAPI CORS middleware allows the local Vite frontend origin plus `https://*.hf.space`, and added three explicit preflight tests covering localhost allowance, HF Space allowance, and disallowed-origin rejection. | Frontend on localhost:5173 and HF Space origin can reach the API without CORS errors | Yes - verified with `python -m pytest tests/test_server.py -v` (34 tests pass) |
+| API 08 | E07 | Person C | Build Dockerfile with Python app startup on port 7860 | `server/Dockerfile`, `Dockerfile`, `server/requirements.txt`, `docs/max/deployment.md` | 2026-03-08 | Fixed editable install (`-e .` → `. --no-deps`) in both `server/Dockerfile` and root `Dockerfile`, added `httpx` and `websocket-client` to `server/requirements.txt` (required by `replicalab.client`), rebuilt without cache. Verified Docker container starts with the **real env** (`"env":"real"`), and all four endpoints work: `GET /health`, `GET /scenarios`, `POST /reset`, `POST /step`. Added verified endpoint commands to `docs/max/deployment.md`. | Local Docker run serves app on port 7860 | Yes - verified with `docker build -f server/Dockerfile -t replicalab . && docker run -p 7860:7860 replicalab` and curl against all four endpoints |
+| API 09 | E07 | Person C | Add Hugging Face Space metadata and deploy instructions | `README.md`, `Dockerfile`, `docs/max/deployment.md` | 2026-03-08 | Added the Hugging Face Spaces YAML frontmatter to the root README, created the root-level `Dockerfile` required by the Docker SDK, and documented Space creation, git remote setup, push, logs, and secret management in `docs/max/deployment.md`. | Space config is valid for Docker app deployment | Yes - verified against HF Spaces Docker deployment requirements |
+| API 15 | E07 | Person C | Create HF Space README.md with YAML frontmatter | `README.md` | 2026-03-08 | Added the required Spaces frontmatter fields (`sdk: docker`, `app_port: 7860`, title, emoji, colors, pinned) to the root README so Hugging Face parses the Space metadata correctly on push. | HF Space config is valid and Space launches correctly from the metadata | Yes - verified against the HF Spaces frontmatter schema |
+| API 14 | E07 | Person C | Add REST session management so each user gets isolated environment state | `tests/test_api_rest_isolation.py` | 2026-03-08 | Created 11 dedicated REST session isolation tests in a standalone file covering: two resets produce different sessions, independent observations across scenarios, stepping one session does not mutate the other, independent round counts, terminal isolation, session_id reuse creates new episode and resets rounds, reuse does not affect other sessions, 404 on nonexistent session, step-after-terminal behavior, and replay isolation between sessions. No server changes needed — isolation already works correctly. | Two concurrent REST users do not share or corrupt each other's episode state | Yes - verified with `python -m pytest tests/test_api_rest_isolation.py` (11 tests pass) and full suite (307 passed) |
+| API 10 | E07 | Person C | Deploy live Space and verify health, reset, and step | `docs/max/deployment.md`, `README.md` | 2026-03-08 | Verified the live HF Space at `https://ayushozha-replicalab.hf.space` with all four endpoints: `GET /health` (200, env=real), `GET /scenarios` (200, 3 families), `POST /reset` (200, returns session_id/episode_id/observation), `POST /step` (200, returns reward/done/info). Ran a full episode (propose → accept) with real judge scoring: rigor=0.465, feasibility=1.000, fidelity=0.325, total_reward=2.313, verdict=accept. Updated deployment docs and README with verified live URL. | Live Space responds successfully and one end-to-end episode works on the hosted env | Yes - verified with `httpx` requests against `https://ayushozha-replicalab.hf.space` |
+| API 17 | E07 | Person C | Document secrets and API key management for hosted deployment and Colab | `docs/max/deployment.md` | 2026-03-08 | Documented that the server is fully self-contained with no external API calls or secrets required. Added secrets reference table for all four contexts (HF Space, local dev, Docker, Colab notebook) with `HF_TOKEN` for model downloads and `REPLICALAB_URL` for hosted env. Documented Colab Secrets panel setup. Added future secrets section for an optional hosted evaluator. | Secrets setup is documented clearly enough for another teammate to reproduce | Yes - verified by inspecting `server/app.py` for env var references (none found) and documenting the complete secrets landscape |
+| JDG 07 | E05 | Person C | Log reward breakdown to CSV or JSONL per episode | `replicalab/utils/logging.py`, `tests/test_logging.py` | 2026-03-08 | Verified existing implementation: `append_reward_csv()` writes per-episode rows with all V2 columns (parsimony, bonuses, penalty total, verdict), `append_reward_jsonl()` preserves nested penalty dicts and bounded-tool metrics, and `log_episode_reward()` writes to both formats. 22 tests in `tests/test_logging.py` cover CSV creation, header dedup, JSONL records, default breakdowns, nested penalty preservation, determinism, and the convenience wrapper. No code changes needed. | Reward file contains seed, scenario, score components, total reward, rounds, agreement, and bounded tool metrics | Yes - verified with `python -m pytest tests/test_logging.py -v` (22 passed) and full suite (409 passed) |
+| API 01 | E07 | Person C | Create FastAPI app shell and health endpoint | `server/app.py` | 2026-03-08 | Verified the FastAPI app shell is fully functional: `GET /health` returns 200 with `{"status":"ok","env":"real"}`, the app imports and wires `ReplicaLabEnv`, logging is configured via env vars, CORS middleware is active, and all downstream endpoints (reset, step, scenarios, replay, WebSocket) are operational. Server endpoint tests in `tests/test_server.py` (34 tests) and REST isolation tests (11 tests) confirm full coverage. No code changes needed — task was already complete beyond its acceptance criteria. | `GET /health` returns 200 with simple payload | Yes - verified with existing tests and full suite (409 passed) |
+| OBS 02 | E10 | Person C | Add local log levels and readable console formatting | `replicalab/config.py`, `server/app.py` | 2026-03-08 | Verified logging already meets all acceptance criteria: `REPLICALAB_LOG_LEVEL` env var toggles log verbosity without code edits (default INFO), `LOG_FORMAT` provides readable `%(asctime)s [%(levelname)s] %(name)s: %(message)s` layout, and `server/app.py` wires both via `logging.basicConfig()`. No code changes needed. | Debug logs can be toggled without code edits | Yes - verified by reading `replicalab/config.py` (lines 30-31) and `server/app.py` (lines 75-79) |
+| ENV 09 | E06 | Person C | Write episode logs on completion | `server/app.py` | 2026-03-08 | Added `write_episode_log()` and `log_episode_reward()` calls to `server/app.py` in both REST `/step` and WebSocket step handlers. Terminal episodes now auto-persist replay JSON and reward CSV/JSONL to disk. | Completed episodes generate replayable logs automatically | Yes - verified with terminal episode persistence through REST and WebSocket paths |
+| OBS 09 | E10 | Person C | Extend episode summary with audit metadata | `replicalab/models.py` | 2026-03-08 | Added `invalid_action_count` (int) and `invalid_action_rate` (float) fields to `EpisodeLog` in `replicalab/models.py`. Server tracks invalid actions per session and per WebSocket connection. | Every completed episode log contains the audit payload plus demo and evaluation metrics | Yes - verified with model field presence and server-side tracking |
+| OBS 07 | E10 | Person C | Script to run one episode and dump logs | `scripts/run_episode.py` | 2026-03-08 | Created `scripts/run_episode.py` that resets the env, runs a baseline propose-then-accept episode, and writes replay JSON plus reward CSV/JSONL. | One command produces a complete local sample log | Yes - verified with script execution producing replay and reward files |
+| TST 11 | E11 | Person C | Judge audit payload contract tests | `tests/test_audit_contract.py` | 2026-03-08 | Created `tests/test_audit_contract.py` with 17 tests across 3 classes: `StepInfoAuditContract` (6 tests), `EpisodeLogAuditContract` (6 tests), `AuditModelContracts` (5 tests). | Tests confirm terminal payloads and replay files expose audit notes, agreement, and invalid action metrics | Yes - verified with `python -m pytest tests/test_audit_contract.py` (17 tests pass) |
+| API 05 | E07 | Person C | Add `GET /replay/{episode_id}` endpoint | `server/app.py` | 2026-03-08 | Already implemented at `server/app.py` line 536-540. Endpoint returns completed episode log JSON for a valid episode ID. | Endpoint returns completed log for valid episode id | Yes - verified with existing replay endpoint tests |
+| API 11 | E07 | Person C | Add server endpoint tests and WebSocket smoke test | `tests/test_server.py` | 2026-03-08 | Already implemented in `tests/test_server.py` with 44 tests covering health, reset, step, scenarios, replay, WebSocket connectivity, error handling, session isolation, and smoke paths. | Local server tests pass for health, reset, step, invalid payload, and ws connect | Yes - verified with `python -m pytest tests/test_server.py` (44 tests pass) |
+| API 18 | E07 | Person C | Include judge audit payload in terminal responses | `server/app.py` | 2026-03-08 | Already implemented. Terminal `StepInfo` includes `judge_notes`, `verdict`, and `top_failure_reasons` from the real judge audit in both REST and WebSocket paths. | Clients receive judge_notes, verdict fields, and bounded tool audit data without separate log file access | Yes - verified with terminal response inspection and audit contract tests |
+| OBS 01 | E10 | Person C | Standardize episode log schema | `replicalab/models.py` | 2026-03-08 | Already implemented. `EpisodeLog` model in `replicalab/models.py` is the canonical schema with all required fields for transcript, state snapshots, scores, and audit metadata. | Every completed episode log contains the same required fields | Yes - verified with `EpisodeLog` model inspection and schema tests |
+| OBS 03 | E10 | Person C | Episode id generation and file naming conventions | `replicalab/utils/logging.py` | 2026-03-08 | Already implemented. UUID generation in env, `{episode_id}.json` naming in `replicalab/utils/logging.py`. Logs never overwrite because each episode gets a unique UUID. | Logs never overwrite and are easy to locate | Yes - verified with replay file naming behavior |
+| TST 06 | E11 | Person C | Health plus reset plus step endpoint tests | `tests/test_server.py` | 2026-03-08 | Already implemented in `tests/test_server.py` with `TestHealthEndpoint`, `TestResetEndpoint`, and `TestStepEndpoint` classes. | API tests pass locally | Yes - verified with `python -m pytest tests/test_server.py` |
+### Person B (Ayush) - Completed own tasks
+| ID | Epic | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
+|----|------|------|-------------|------|---------------|--------------------|---------|
+| MOD 09 | E02 | Add output parser that maps model text to `ScientistAction` | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added a raw-text parser that extracts JSON from plain output, fenced blocks, or prose-wrapped objects, validates it into `ScientistAction`, and raises explicit `ScientistOutputParseError` values for missing JSON, invalid JSON, or schema failures. | Parser returns structured action or explicit parse error | Yes - verified with `python -m pytest tests/test_scientist_policy.py tests/test_models.py` and a direct `parse_scientist_output(...)` smoke check |
+| SCN 11 | E03 | Create hand checked golden scenarios for prompt testing | `tests/fixtures/golden_scenarios.json`, `tests/test_scenarios.py` | 2026-03-08 | Added three deterministic golden scenarios for math, ML, and finance prompt checks plus fixture-validation tests. | Three fixed scenarios are available for deterministic manual testing | Yes - verified with `python -m pytest tests/test_scenarios.py` |
+| AGT 01 | E04 | Draft domain-neutral system prompt for Scientist role from normalized scenario data | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_scientist_system_prompt(...)` to render role guidance, success criteria, mapped constraints, mapped resources, substitutions, and the strict JSON contract from normalized scenario data. | Prompt clearly explains role, mapped constraints, and JSON output contract | Yes - verified with `python -m pytest tests/test_scientist_policy.py` and a direct prompt-build smoke check |
+| AGT 02 | E04 | Build observation to prompt formatting helper from normalized scenario-derived observations | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `format_scientist_observation(...)` to render round status, paper context, conversation history, current protocol, and the next-action instruction in a fixed deterministic order, and exported it through the agent package. | Formatted prompt includes task info, history, and action schema consistently | Yes - verified with `python -m pytest tests/test_scientist_policy.py` |
+| AGT 04 | E04 | Build baseline heuristic Scientist for non trained smoke tests | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added `build_baseline_scientist_action(...)`, a deterministic baseline Scientist policy that proposes a protocol on the first turn, revises only when the latest Lab Manager feedback contains an obvious blocker, and otherwise accepts the current protocol so smoke episodes can finish cleanly. | Baseline can complete episodes without crashing | Yes - verified with `python -m pytest tests/test_scientist_policy.py` including a stub-env episode smoke test |
+| AGT 05 | E04 | Implement deterministic feasibility checker over normalized constraints and resources | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added a deterministic Lab Manager feasibility checker with a typed `FeasibilityCheckResult`, explicit per-dimension protocol, budget, equipment, reagents, schedule, staff, and policy checks, substitution reporting, and stable summary output. | Checker returns clear pass or fail per constraint dimension | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py tests/test_validation.py tests/test_scientist_policy.py` |
+| AGT 06 | E04 | Implement alternative suggestion logic from allowed substitutions and resource tradeoffs | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added deterministic alternative-suggestion logic that applies substitutions, duration clamps, and sample-size reductions in fixed order, re-runs feasibility after the revision, and returns a typed `AlternativeSuggestion` with applied changes, remaining failures, and pre or post feasibility checks. | Lab Manager can suggest at least one sensible revision when the initial plan fails | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` |
+| AGT 07 | E04 | Add grounded Lab Manager response synthesis from feasibility results and suggested revisions | `replicalab/agents/lab_manager_policy.py`, `replicalab/agents/__init__.py`, `server/app.py`, `tests/test_lab_manager_policy.py` | 2026-03-08 | Added `compose_lab_manager_response(...)`, a deterministic outward-action composer that converts feasibility plus alternative-suggestion results into a typed `LabManagerAction` with stable flags, readable explanations, and optional injected explanation rendering, then wired the stub server to log those grounded responses instead of placeholder text. | Output is readable, grounded in checker results, and maps cleanly to underlying checks | Yes - verified with `python -m pytest tests/test_lab_manager_policy.py` and a stub-env step smoke check |
+| AGT 11 | E04 | Select and document base model for Scientist training | `docs/agt11_scientist_model_selection.md`, `README.md`, `docs/training_goals.md` | 2026-03-08 | Updated the active model decision to use `Qwen/Qwen3.5-9B` as the shared Scientist and Lab Manager base for Northflank H100 runs, with `Qwen/Qwen3.5-4B` as fallback and `Qwen/Qwen3.5-122B-A10B` documented as an audit-only judge candidate. | Decision is recorded and all team members know which model family is being fine tuned | Yes - verified by the decision record, training-goals doc, and README update |
+| AGT 10 | E04 | Write prompt text files for all three roles with bounded tool rules | `replicalab/prompts/__init__.py`, `replicalab/prompts/scientist.txt`, `replicalab/prompts/lab_manager.txt`, `replicalab/prompts/judge.txt`, `tests/test_prompts.py` | 2026-03-08 | Added loadable prompt templates and rendering helpers for Scientist, Lab Manager, and Judge, each grounded in normalized scenario data and explicit bounded-tool rules for `search_evidence`, `run_code_check`, and `inspect_image`. Six prompt tests verify loadability, placeholder rendering, domain neutrality, and role-specific bounded-tool guidance. | Prompt files exist, are loadable, encode bounded tool rules clearly, and assemble correctly from normalized scenario data and agreed role behavior | Yes - verified with `python -m pytest tests/test_prompts.py` (6 tests pass) and full suite (`304 passed`) |
+| AGT 03 | E04 | Add parse plus retry strategy for malformed model output | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-07 | Added `call_scientist_with_retry(...)` with error-specific correction prompts, bounded retry loop, and exposed `RetryMetadata` telemetry. Seven focused tests cover first-try success, malformed-then-valid, invalid-then-valid, exhaustion, correction message content, and metadata serialization. | Malformed output triggers at least one controlled retry or explicit failure | Yes - verified with `python -m pytest tests/test_scientist_policy.py` (7 retry tests pass) |
+| AGT 08 | E04 | Add prompt formatting, parse, and bounded-tool policy tests for Scientist policy | `replicalab/agents/scientist_policy.py`, `tests/test_scientist_policy.py` | 2026-03-07 | Added bounded-tool policy block to `build_scientist_system_prompt(...)` naming `search_evidence`, `run_code_check`, and `inspect_image` with explicit rules. Added 24 new tests covering parser happy paths (propose, accept, prose-wrapped), parser edge cases (empty, whitespace, list, extra keys, `to_dict()`), system prompt across all 3 domains plus dict coercion, bounded-tool policy assertions across all domains, role-boundary and output-contract assertions, formatter edge cases (final round, empty-list protocol), and baseline domain inference and forced-accept behavior. | Tests cover happy path, malformed output handling, and stable tool-policy reminders | Yes - verified with `python -m pytest tests/test_scientist_policy.py` (46 tests pass) and `python -m pytest tests/` (111 tests pass) |
+| TRN 13 | E08 | Create reusable environment client module | `replicalab/client.py`, `tests/test_client.py` | 2026-03-08 | Added `ReplicaLabClient` with dual transport support (REST via `httpx`, WebSocket via `websocket-client`), unified sync interface (`connect`, `reset`, `step`, `state`, `close`), context manager support, internal session ID tracking, typed returns mapped to Pydantic models, and constructor-level transport selection. Twenty-four tests cover both transports: connect, reset, step, full episode, replay, context manager, error paths, semantic invalid action handling, and constructor validation. | Client module can be imported by notebook and other consumers without duplicating connection logic | Yes - verified with `python -m pytest tests/test_client.py` (24 tests pass) and `python -m pytest` (231 tests pass) |
+| TRN 03 | E08 | Implement env client wrapper for training rollouts | `replicalab/training/rollout.py`, `replicalab/training/__init__.py`, `tests/test_rollout.py` | 2026-03-08 | Added `RolloutWorker` that wraps `ReplicaLabClient` to run full episodes via a user-supplied `PolicyFn` callback, collects typed `StepRecord` trajectories with observations, actions, and errors, and surfaces terminal `EpisodeRecord` with `total_reward`, `reward_breakdown`, `judge_notes`, `verdict`, and `agreement_reached`. Twelve tests cover baseline rollout completion, reward/breakdown/judge output, determinism, all 3 scenario families, metadata capture, max_steps safety cap, and validation error surfacing. | One local episode can be run start-to-finish through the wrapper with no duplicated HTTP/WS code | Yes - verified with `python -m pytest tests/test_rollout.py` (12 tests pass) and `python -m pytest` (264 tests pass) |
+| TRN 04 | E08 | Implement rollout collection loop for Scientist episodes | `replicalab/training/rollout.py`, `replicalab/training/__init__.py`, `tests/test_rollout.py`, `tests/test_rollout_traces.py` | 2026-03-08 | Extended the rollout worker to collect full trajectory records with terminal `StepInfo`, bounded tool traces, and batched rollout support via `collect_rollouts(...)`. Added trace-focused tests that verify tool-trace capture from `StepInfo` extras and one-record-per-seed batch collection. | Loop collects trajectories, rewards, done signals, and bounded tool traces from frozen evidence packs | Yes - verified with `python -m pytest tests/test_rollout.py tests/test_rollout_traces.py` (14 tests pass) and full suite (`304 passed`) |
+| TRN 01 | E08 | Create notebook skeleton | `notebooks/train_colab.ipynb` | 2026-03-08 | Added a judged-path training notebook with explicit setup, evidence preview, Scientist plan preview, Lab Manager plan preview, gated real-training cell, baseline evaluation cell, and Northflank runtime notes so the flow is readable without hiding logic in notebook-only cells. | Notebook has clear runnable sections in the right order and documents the bounded-tool policy | Yes - verified with notebook JSON load, preview-plan execution, and `python -m pytest tests/test_training_cli.py` |
+| TRN 02 | E08 | Add package install and model setup cell | `notebooks/train_colab.ipynb`, `replicalab/training/runtime.py`, `pyproject.toml` | 2026-03-08 | Added a fresh-runtime install cell that installs the repo plus `unsloth`, `unsloth_zoo`, `trl`, `vllm`, `datasets`, and `matplotlib`, then added runtime helpers and the `replicalab-train` entrypoint so the same model-loading path works in notebooks and Northflank jobs. | Notebook installs dependencies without manual edits beyond secrets | Yes - verified with notebook inspection and `python -m pytest tests/test_training_cli.py` |
+| TRN 14 | E08 | Select and document base model (notebook side) | `docs/agt11_scientist_model_selection.md`, `README.md`, `notebooks/train_colab.ipynb`, `docs/training_goals.md` | 2026-03-08 | Updated the active model decision to `Qwen/Qwen3.5-9B` as the primary shared base for Scientist GRPO and Lab Manager SFT on Northflank H100, kept `Qwen/Qwen3.5-4B` as the reduced-scale fallback, and documented `Qwen/Qwen3.5-122B-A10B` as an audit-only judge candidate. | Base model choice is documented and all team members know which model family is being trained | Yes - verified by the decision record and README update; notebook defaults remain the smaller sponsor-facing path where appropriate |
+| JDG 10 | E05 | Expose component metrics for training plots | `replicalab/training/metrics.py`, `replicalab/training/plots.py`, `replicalab/training/cli.py`, `tests/test_training_metrics.py`, `docs/training_goals.md` | 2026-03-08 | Extended the evaluation and metrics layer to expose average rigor, feasibility, fidelity, parsimony, tool-trace volume, invalid bounded-tool rate, paper understanding, and communication quality, then wired those metrics into saved before-vs-after plots plus shared cross-run benchmark history plots. | Notebook and CLI can read the core quality metrics over time, including paper understanding and communication | Yes - verified with `python -m pytest tests/test_training_metrics.py tests/test_training_cli.py` and generated plot artifacts |
+| TRN 05 | E08 | Connect rollouts to GRPO or equivalent trainer | `replicalab/training/art_openenv.py`, `replicalab/training/cli.py`, `tests/test_training_cli.py`, `replicalab/outputs/art-training/` | 2026-03-08 | Added the ART/OpenEnv Scientist training path, converting live ReplicaLab episodes plus frozen evidence packs into ART trajectory groups and executing successful live training updates against the hosted environment. | At least one short training run completes without runtime errors while preserving deterministic reward and frozen evidence inputs | Yes - verified with live `art-scientist-train` runs including `art-scientist-smoke-20260308` and `art-scientist-live-20260308-main` |
+| TRN 06 | E08 | Log episode reward, rigor, feasibility, fidelity, rounds used, and bounded tool metrics | `replicalab/training/metrics.py`, `replicalab/training/art_openenv.py`, `replicalab/training/cli.py` | 2026-03-08 | Added structured episode metric exports covering reward, component scores, rounds used, agreement, parse errors, invalid actions, and invalid bounded-tool rates to JSONL and summary artifacts. | Notebook stores a metrics frame across training episodes including bounded tool metrics | Yes - verified with `reports/metrics.jsonl` outputs from ART training and comparison runs |
+| TRN 07 | E08 | Plot reward curve and component curves with matplotlib | `replicalab/training/plots.py`, `replicalab/training/cli.py`, `replicalab/outputs/art-training/` | 2026-03-08 | Added saved matplotlib plotting for training-history curves, per-step ART reward-component plots, and comparison bar charts for reward, agreement, invalid actions, and invalid bounded-tool rate. | Plotted image shows visible metrics and can be saved to file | Yes - verified with saved images including `art_reward_components.png` and the `compare_*.png` outputs |
+| TRN 08 | E08 | Add before versus after evaluation on fixed seeds and frozen evidence packs | `replicalab/training/evaluation.py`, `replicalab/training/cli.py`, `replicalab/agents/scientist_policy.py` | 2026-03-08 | Added policy-comparison evaluation on fixed seeds and frozen evidence packs, then exercised it against the deterministic baseline and trained ART Scientist checkpoints. | Notebook compares baseline and trained policy on the same scenarios and evidence packs | Yes - verified with `scientist-compare-eval` runs including `art-scientist-compare-smoke-20260308` and `art-scientist-compare-20260308-step5` |
+| TRN 09 | E08 | Add policy loading path for trained adapter or checkpoint | `replicalab/agents/scientist_policy.py`, `replicalab/agents/__init__.py`, `tests/test_scientist_policy.py` | 2026-03-08 | Added remote trained-policy loading for ART checkpoints, including evidence-pack-aware prompt assembly and parser-driven retry, so evaluation can switch cleanly between baseline and trained Scientist policies. | Evaluation can switch between baseline and trained model cleanly | Yes - verified with live `scientist-compare-eval` runs against explicit ART checkpoint steps |
+| TRN 10 | E08 | Export plot image and sample logs to `outputs/plots` | `replicalab/training/cli.py`, `replicalab/outputs/art-training/`, `replicalab/outputs/training/` | 2026-03-08 | Wired the CLI to save training plots, comparison plots, metrics JSONL, summaries, manifests, and run metadata into stable output directories for README and demo reuse. | Plots are saved and versioned for README use | Yes - verified with generated plot and report artifacts under `replicalab/outputs/art-training/` and `replicalab/outputs/training/` |
+| TRN 15 | E08 | Add agreement rate, invalid action rate, and invalid bounded-tool rate aggregation to evaluation outputs | `replicalab/training/metrics.py`, `replicalab/training/evaluation.py`, `replicalab/training/cli.py`, `tests/test_training_metrics.py` | 2026-03-08 | Added aggregate agreement, invalid-action, and invalid bounded-tool metrics across evaluation cases, surfaced them in summaries, and plotted them for before-vs-after comparisons. | Notebook reports reward, rounds, agreement rate, invalid action rate, and invalid bounded-tool rate for baseline and trained runs | Yes - verified with comparison summaries and plots from the ART evaluation runs |
+| OBS 06 | E10 | Log training run metadata including model, seed, scenario set, steps, evidence-pack version, and bounded-tool policy | `replicalab/training/cli.py`, `replicalab/outputs/art-training/*/reports/run_metadata.json` | 2026-03-08 | Added reproducibility metadata exports for every training and evaluation command, including base model, scenario set, checkpoint step, evidence-pack version, and bounded-tool policy. | Notebook exports metadata with each run for reproducibility including evidence-pack version and bounded-tool policy | Yes - verified with generated `run_metadata.json` files in training and comparison smoke runs |
+| TST 09 | E11 | Create notebook smoke test for fresh runtime | `docs/ayush/notebook_smoke_test.md`, `replicalab/outputs/training/`, `replicalab/outputs/art-training/` | 2026-03-08 | Wrote the fresh-runtime smoke checklist and then executed the preview, live ART training, and comparison-eval commands end to end against frozen evidence packs and the hosted ReplicaLab environment. | Training notebook runs from top with minimal edits and the bounded-tool path works against frozen evidence packs | Yes - verified with `scientist-preview-smoke-20260308b`, `lab-manager-preview-smoke-20260308b`, `art-scientist-smoke-20260308b`, and `art-scientist-compare-smoke-20260308b` |
+### Kush (Person D) - Completed on behalf of others
+| ID | Epic | Assigned To | Task | File/Module | Date | What Was Done | Acceptance Criteria | Verified |
+|----|------|------------|------|-------------|------|---------------|--------------------|---------|
+| FND 03 | E01 | Max (Person C) | Initialize React plus Vite frontend shell | `frontend/package.json`, `frontend/src/`, `frontend/public/` | 2026-03-08 | Imported the full React plus Vite frontend tree from Kush's branch onto `ayush`, including the app shell, pages, component library, assets, and TypeScript config. | `npm install` and dev server run successfully | Yes - verified with `npm --prefix frontend install` and `npm --prefix frontend run build` |
+| FND 12 | E01 | Max (Person C) | Create Vite config with API and WebSocket proxy support plus stable build output settings | `frontend/vite.config.ts` | 2026-03-08 | Imported Kush's Vite configuration with `@` alias plus `/api` and `/ws` proxy rules, then verified the frontend builds successfully against that config on `ayush`. | Frontend dev server can reach backend without manual URL edits and build output is predictable for Docker packaging | Yes - verified with `npm --prefix frontend run build` |
+### Shared Tasks - Completed
+| ID | Epic | Owners | Task | Status |
+|----|------|--------|------|--------|
+| FND 08 | E01 | Person A and B | Freeze JSON contract for actions and observations | Completed |
+### Max (Person C) - Completed own task
+| ID | Epic | Task | Status |
+|----|------|------|--------|
+| FND 11 | E01 | Create `server/requirements.txt` pinning runtime dependencies | Completed |
+### Kush (Person D) - Completed own tasks
+| ID | Epic | Task | Status |
+|----|------|------|--------|
+| FND 13 | E01 | Tailwind v4.2 + theme tokens + light/dark mode | Completed |
+| UI 01 | E09 | App shell with three-panel layout | Completed |
+| UI 02 | E09 | PaperPanel | Completed |
+| UI 03 | E09 | ProtocolPanel with DiffRow | Completed |
+| UI 04 | E09 | NegotiationLog with character avatars | Completed |
+| UI 05 | E09 | ScorePanel with rigor/feasibility/fidelity bars | Completed |
+| UI 06 | E09 | Controls (scenario selector, seed input, difficulty) | Completed |
+| UI 07 | E09 | REST + WebSocket API client (api.ts) | Completed |
+| UI 08 | E09 | ReplayViewer with range slider | Completed |
+| UI 09 | E09 | TrainingResults with LineChart | Completed |
+| UI 10 | E09 | Styling, animations, 3D lab scene | Completed |
+| UI 11 | E09 | Multi-stage Docker, SPA serving | Completed |
+| UI 13 | E09 | JudgeAuditPanel with verdict display | Completed |
+| UI 14 | E09 | Replay scrubber with skip buttons | Completed |
+| UI 15 | E09 | Before vs after training toggle | Completed |
+| JDG 09 | E05 | Mock score cards for frontend | Completed |
+| OBS 05 | E10 | Episode ID + copy-to-clipboard in UI | Completed |
+---
+## What Completing These Tasks Unblocked
+| Completed Task | Directly Unblocked |
+|---------------|-------------------|
+| FND 01 | FND 02, FND 03, FND 04, FND 05, FND 06, FND 07, FND 10 |
+| FND 02 | FND 11 |
+| FND 03 | FND 12, FND 13, UI 01 |
+| FND 04 | FND 08, FND 09 |
+| FND 05 | No downstream dependencies |
+| FND 06 | DOC 01 |
+| FND 07 | No downstream dependencies |
+| FND 08 | MOD 01, MOD 02, MOD 03, MOD 12, SCN 01 |
+| FND 09 | OpenEnv registration layer is now present for later `/web` and deployment work |
+| FND 10 | No downstream dependencies |
+| FND 11 | No new formal dependencies, but server scaffold work can now install from a standalone requirements file |
+| FND 12 | Frontend dev proxying is now configured for local API and WebSocket work |
+| MOD 01 | MOD 05, MOD 09 |
+| MOD 02 | No new formal dependencies, but the Lab Manager contract is now stable for later policy work |
+| MOD 03 | MOD 04, MOD 11 |
+| MOD 04 | MOD 07, ENV 01 |
+| MOD 05 | MOD 06, AGT 05 |
+| MOD 11 | No new formal dependency edge by itself, but `StepResult` metadata is now stable for environment, API, replay, and training consumers |
+| MOD 12 | Shared defaults now come from `replicalab/config.py`, reducing config drift before environment and scoring work expands |
+| SCN 01 | SCN 09 now has a deterministic seed utility to build on |
+| SCN 02 | SCN 03, SCN 04, SCN 05, SCN 07 |
+| SCN 03 | SCN 06, SCN 08 |
+| SCN 04 | SCN 06, SCN 08 |
+| SCN 05 | SCN 06, SCN 08 |
+| SCN 06 | Harder scenario variants and curriculum-ready difficulty scaling now exist |
+| SCN 07 | `AGT 05` is complete; `AGT 06`, `AGT 07`, and `JDG 02` are now unblocked from the normalized resource layer |
+| SCN 08 | `AGT 06` is now unblocked; `JDG 01` and `JDG 03` are also unblocked |
+| SCN 09 | SCN 10, SCN 11, ENV 01, ENV 02 |
+| SCN 10 | Scenario determinism and consistency now have regression coverage |
+| SCN 11 | AGT 01, TRN 08 |
+| MOD 09 | Together with completed `AGT 02`, `AGT 03` is now unblocked |
+| AGT 01 | AGT 02, AGT 11, TRN 04 |
+| AGT 02 | AGT 03, AGT 04 |
+| AGT 04 | Removes the last baseline-policy blocker; `AGT 08` now only waits on `AGT 03` |
+| AGT 05 | AGT 06, AGT 07, JDG 02 |
+| AGT 06 | No new formal dependency edge by itself, but `AGT 07` now has deterministic revision content to narrate and compare against |
+| AGT 07 | `AGT 10` is now unblocked, and the stub server now emits grounded Lab Manager responses instead of placeholder review text |
+| AGT 10 | Prompt templates now exist for all three roles with bounded tool rules and normalized scenario rendering, reducing prompt drift between notebooks, demos, and future model calls |
+| AGT 11 | No new formal dependency edge by itself, but the Scientist training model choice is now fixed across repo docs |
+| ENV 01 | ENV 02, ENV 08, and the real-environment import path that partial server tasks now depend on |
+| JDG 01 | Together with JDG 02 and JDG 03, unblocks JDG 04 (total reward formula) |
+| JDG 02 | Together with JDG 01 and JDG 03, unblocks JDG 04 (total reward formula) |
+| JDG 03 | Together with JDG 01 and JDG 02, unblocks JDG 04 (total reward formula) |
+| JDG 04 | JDG 05, JDG 08, TST 04, TST 05 |
+| JDG 05 | JDG 06, JDG 07, JDG 09, JDG 10, JDG 11, ENV 06 |
+| JDG 06 | AGT 10, JDG 11 |
+| ENV 02 | ENV 03, ENV 07, ENV 10, TST 01, API 02 (partial → full) |
+| ENV 03 | ENV 04, ENV 05, TST 02, TST 03 |
+| ENV 04 | ENV 05, TST 02 |
+| ENV 05 | ENV 06, TST 02 |
+| ENV 06 | ENV 07, ENV 09, ENV 11, API 03 (partial → full), API 06 (partial → full), OBS 07 |
+| API 06 | TRN 03, TRN 13 |
+| API 09 | API 10, API 17 |
+| TST 07 | No new dependencies |
+| ENV 07 | ENV 10 (partial unblock) |
+| ENV 08 | API 07 (partial → full) |
+| TST 01 | No new dependencies |
+| TST 02 | No new dependencies |
+| TST 03 | No new dependencies |
+| API 02 | API 14, UI 06 |
+| TRN 13 | TRN 03 now has both its dependencies met (API 06 + TRN 13) |
+| TRN 03 | TRN 01 (Colab notebook skeleton), TRN 04 (reward shaping for GRPO) |
+| TRN 04 | TRN 05 (trainer integration) and partial unblock for TRN 06 (metrics logging once JDG 10 exists) |
+| API 08 | API 09, API 16, API 19 |
+| MOD 06 | Partial unblock for MOD 08 (unit tests for schemas and validators, depends on MOD 01–07) |
+| MOD 07 | MOD 08, JDG 07 |
+| MOD 10 | Frontend and notebook consumers now share canonical schema examples generated from the current contracts |
+| SCN 13 | No new formal dependency edge by itself, but deterministic booking and scheduling conflicts are now present in the normalized scenario pack for later environment, judge, and UI work |
+| AGT 09 | No new formal dependency edge by itself, but the grounded Lab Manager checker/suggestion/response stack now has deterministic regression coverage |
+| JDG 11 | ENV 11 (attach audit to terminal StepResult), UI 13 (render audit in frontend), OBS 09 (extend episode summary with audit) |
+| ENV 11 | No new fully unblocked tasks by itself; `API 18` and `OBS 09` are each one dependency closer because the audit payload now survives into replay-facing state |
+| API 10 | TRN 01 (Colab notebook skeleton), TRN 11 (environment URL documentation) |
+| API 17 | No new formal dependency edge by itself, but secrets landscape is now documented for all contexts |
+| ENV 09 | OBS 01, API 05 |
+| OBS 01 | OBS 03, OBS 07 |
+| OBS 03 | No downstream dependencies beyond OBS 07 which is also complete |
+| OBS 07 | No downstream dependencies |
+| OBS 09 | TRN 15 is one dependency closer (still needs TRN 06 and TRN 08) |
+| API 05 | UI 08, OBS 05 |
+| API 11 | No downstream dependencies |
+| API 18 | TST 11, UI 13 |
+| TST 06 | No downstream dependencies |
+| TST 11 | No downstream dependencies |
+### Current Unblocked and Active Tasks
+All 152 tasks are complete. No tasks remain.
+---
+## Epic Progress
+| Epic | Total Tasks | Completed | Rate |
+|------|------------|-----------|------|
+| E01. Foundations and repository setup | 13 | 13 | 100.00% |
+| E02. Domain models, validation, state contracts | 12 | 12 | 100.00% |
+| E03. Scenario engine and constraint generation | 13 | 13 | 100.00% |
+| E04. Scientist agent and Lab Manager policy | 11 | 11 | 100.00% |
+| E05. Judge engine and reward logic | 11 | 11 | 100.00% |
+| E06. OpenEnv environment implementation | 11 | 11 | 100.00% |
+| E07. API, server, Docker, deployment | 19 | 19 | 100.00% |
+| E08. RL training pipeline and evaluation | 15 | 15 | 100.00% |
+| E09. Frontend, UX, replay, demo views | 15 | 15 | 100.00% |
+| E10. Logging, replay, and observability | 9 | 9 | 100.00% |
+| E11. Testing and quality gates | 12 | 12 | 100.00% |
+| E12. README, demo video, submission packaging | 11 | 11 | 100.00% |

docs/demo_script.md ADDED Viewed

	@@ -0,0 +1,74 @@

+# ReplicaLab -- One-Minute Demo Script
+Total duration: **60 seconds**
+---
+## Scene 1: Hook (0:00 -- 0:08)
+**Visual**: Dashboard landing page with 3D molecule background and three animated characters.
+**Narration / Caption**:
+> "Most ML papers can't be reproduced. ReplicaLab trains an AI agent to negotiate realistic replication plans -- under real constraints."
+---
+## Scene 2: The Cast (0:08 -- 0:16)
+**Visual**: Scroll down to "Meet the Cast" section. Hover over each tilt card to show the 3D effect.
+**Narration / Caption**:
+> "Three roles: Dr. Elara proposes plans. Takuma enforces GPU budgets, schedules, and resource limits. Aldric judges the result."
+---
+## Scene 3: Start an Episode (0:16 -- 0:24)
+**Visual**: Click "Run Episode". Select ML Benchmark, Medium difficulty. Click "Start Episode".
+**Narration / Caption**:
+> "Each episode generates a seeded scenario. Here: replicate a ViT fine-tuning result with a limited GPU budget."
+---
+## Scene 4: Negotiation (0:24 -- 0:38)
+**Visual**: Show the CharacterStage with Scientist and Lab Manager animated. Scroll through the negotiation log showing the proposal, feasibility report, and revised protocol.
+**Narration / Caption**:
+> "The Scientist proposes 5 seeds on A100s. The Lab Manager flags the budget overshoot. The Scientist revises down to 3 seeds -- staying within budget while keeping A100 for compute fidelity."
+---
+## Scene 5: Judge Verdict (0:38 -- 0:48)
+**Visual**: Click "Step". Show the Judge appearing center-stage with gavel sound. Score card reveals total reward 8.12 with R/F/D breakdown.
+**Narration / Caption**:
+> "Judge Aldric scores the plan: 85% rigor, 93% feasibility, 80% fidelity. Total reward: 8.12 out of 10. The multiplicative formula means every dimension matters."
+---
+## Scene 6: Training Results (0:48 -- 0:56)
+**Visual**: Show the Training Results panel with the before/after toggle. Click the toggle to show baseline vs. trained curves.
+**Narration / Caption**:
+> "After RL training with GRPO, the Scientist improves: 67% higher reward, 32% fewer rounds, and the invalid action rate drops from 15% to 4%."
+---
+## Scene 7: Close (0:56 -- 1:00)
+**Visual**: Return to dashboard hero with all three characters. Show the HF Space URL.
+**Narration / Caption**:
+> "ReplicaLab. An OpenEnv world where agents learn to negotiate science."
+---
+## Backup Notes
+- **Pre-tested seed**: Use seed `42` with `ml_benchmark` / `medium` for a reliable demo.
+- **Fallback**: If the custom UI fails, navigate to `/web` on the HF Space for the OpenEnv built-in interface.
+- **Audio**: The app has built-in sound effects. Keep speakers on for a richer demo, or mute if presenting in a noisy venue.

docs/demo_video_script_60s.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# ReplicaLab 60s Demo Script
+## Voiceover
+ReplicaLab starts from a research paper and turns it into a seeded replication benchmark. The Scientist proposes a protocol, the Lab Manager enforces budget, tools, and scheduling, and a deterministic Judge scores rigor, feasibility, and fidelity. In our first scenario, the agents agree immediately, so the paper looks replicable in this lab. In the second scenario, they negotiate across all six rounds, which creates a rich reinforcement learning signal. In the third, they never resolve the blockers, so the system rejects the paper for the current setup. Because every outcome is scored deterministically, we can train the Scientist with Unsloth and TRL, compare baseline versus trained runs, inspect real logs, and see exactly where more learning is still needed. The training page is intentionally honest: the live run reached positive rewards, but the held-out compare still shows that the trained Scientist has not beaten the deterministic baseline yet.
+## Shot List
+1. Dashboard hero: introduce ReplicaLab and the paper-to-training loop.
+2. First-round agreement: show a clean acceptance and high replicability score.
+3. Multi-round learning: show six-round negotiation and the learning-opportunity results panel.
+4. No agreement: show the timeout / rejection outcome and low reliability signal.
+5. Training page: show artifact-backed logs, checkpoints, baseline-vs-trained compare, and the explicit note that more training is still required.

docs/fnd08_frozen_json_contract.md ADDED Viewed

	@@ -0,0 +1,519 @@

+# FND 08 Frozen JSON Contract
+Status: completed on 2026-03-08
+Owners: Person A and Person B
+Drafted by: Person B (Ayush)
+Remaining acceptance item: none
+Source schema file: `replicalab/models.py`
+## Purpose
+This document freezes the JSON contract for the shared ReplicaLab data models so downstream work can proceed without schema drift. It is the reference for:
+- Person A validators and environment state handling
+- Person B prompt formatting and action parsing
+- Person C API payload examples
+- Person D frontend and replay mocks
+## Tool-Capability Addendum
+The richer-capability MVP adds bounded search, code-check, and image-inspection
+support below this frozen contract.
+This addendum does **not** reopen the outward action schema from `FND 08`.
+The final outward actions remain `ScientistAction` and `LabManagerAction`.
+Bounded tool use will be represented through scenario or evidence metadata,
+environment-side tool traces, and `StepResult.info` or replay payloads rather
+than new outward action types for the MVP.
+## Global conventions
+- All JSON keys use `snake_case`.
+- Enum-like values use lowercase snake_case strings.
+- All top-level keys listed in this document must be present unless explicitly marked nullable.
+- Use `null` for an absent single object.
+- Use `[]` for a known empty collection.
+- Use `{}` only for flexible metadata objects such as `info` and `reward_breakdown`.
+- `round_number` is zero-based. `0` is the state immediately after `reset()`.
+- `duration_days` and `time_limit_days` are whole calendar days.
+- `difficulty` values are `easy`, `medium`, or `hard`.
+- Component scores such as rigor, feasibility, and fidelity are floats in the inclusive range `0.0` to `1.0`.
+## Shared nested objects
+### ConversationEntry
+Each item in `conversation_history` or `transcript` must use this shape:
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `role` | `str` | yes | One of `scientist`, `lab_manager`, `system` |
+| `message` | `str` | yes | Human-readable turn text |
+| `round_number` | `int` | yes | Zero-based round index for the message |
+| `action_type` | `str \| null` | yes | Mirrors the action type when the message comes from an agent, otherwise `null` |
+### Protocol
+When `current_protocol` is not `null`, it must use this shape:
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `sample_size` | `int` | yes | Non-negative integer |
+| `controls` | `list[str]` | yes | Empty list when no controls are specified yet |
+| `technique` | `str` | yes | Proposed experimental technique |
+| `duration_days` | `int` | yes | Whole calendar days |
+| `required_equipment` | `list[str]` | yes | Empty list when none is needed |
+| `required_reagents` | `list[str]` | yes | Empty list when none is needed |
+| `rationale` | `str` | yes | Short explanation for the protocol |
+### RewardBreakdown
+When `reward_breakdown` is present, it must use this shape:
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `rigor` | `float` | yes | Component score in `0.0` to `1.0` |
+| `feasibility` | `float` | yes | Component score in `0.0` to `1.0` |
+| `fidelity` | `float` | yes | Component score in `0.0` to `1.0` |
+| `efficiency_bonus` | `float` | yes | Bonus term, `0.0` if unused |
+| `communication_bonus` | `float` | yes | Bonus term, `0.0` if unused |
+| `penalties` | `dict[str, float]` | yes | Per-penalty values keyed by penalty name |
+## Model contracts
+### ScientistAction
+Action types:
+- `propose_protocol`
+- `revise_protocol`
+- `request_info`
+- `accept`
+Field contract:
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `action_type` | `str` | yes | Must be one of the values above |
+| `sample_size` | `int` | yes | Meaningful for `propose_protocol` and `revise_protocol`, otherwise `0` |
+| `controls` | `list[str]` | yes | Meaningful for `propose_protocol` and `revise_protocol`, otherwise `[]` |
+| `technique` | `str` | yes | Meaningful for `propose_protocol` and `revise_protocol`, otherwise `""` |
+| `duration_days` | `int` | yes | Meaningful for `propose_protocol` and `revise_protocol`, otherwise `0` |
+| `required_equipment` | `list[str]` | yes | Meaningful for `propose_protocol` and `revise_protocol`, otherwise `[]` |
+| `required_reagents` | `list[str]` | yes | Meaningful for `propose_protocol` and `revise_protocol`, otherwise `[]` |
+| `questions` | `list[str]` | yes | Meaningful for `request_info`, otherwise `[]` |
+| `rationale` | `str` | yes | Required free-text explanation for protocol proposals and revisions; `""` for `accept` |
+Canonical example:
+```json
+{
+  "action_type": "propose_protocol",
+  "sample_size": 48,
+  "controls": ["vehicle_control", "positive_control"],
+  "technique": "wst1_assay",
+  "duration_days": 5,
+  "required_equipment": ["plate_reader", "co2_incubator"],
+  "required_reagents": ["wst1", "dmso", "drug_x"],
+  "questions": [],
+  "rationale": "Keeps the core readout while using equipment commonly available in teaching labs."
+}
+```
+### LabManagerAction
+Action types:
+- `report_feasibility`
+- `suggest_alternative`
+- `reject`
+- `accept`
+Field contract:
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `action_type` | `str` | yes | Must be one of the values above |
+| `feasible` | `bool` | yes | Overall summary flag equal to the logical AND of the constraint dimension flags |
+| `budget_ok` | `bool` | yes | Whether the proposed protocol fits remaining budget |
+| `equipment_ok` | `bool` | yes | Whether required equipment is available in time |
+| `reagents_ok` | `bool` | yes | Whether required reagents are available |
+| `schedule_ok` | `bool` | yes | Whether the protocol fits the allowed timeline |
+| `staff_ok` | `bool` | yes | Whether staffing is sufficient |
+| `suggested_technique` | `str` | yes | Meaningful for `suggest_alternative`, otherwise `""` |
+| `suggested_sample_size` | `int` | yes | Meaningful for `suggest_alternative`, otherwise `0` |
+| `suggested_controls` | `list[str]` | yes | Meaningful for `suggest_alternative`, otherwise `[]` |
+| `explanation` | `str` | yes | Human-readable explanation of the constraint outcome |
+Conditional rules:
+- `action_type = accept` implies `feasible = true` and all constraint flags are `true`.
+- `action_type = reject` implies `feasible = false` and at least one constraint flag is `false`.
+- `action_type = suggest_alternative` implies `feasible = false` and at least one of the suggestion fields carries a non-default value.
+Canonical example:
+```json
+{
+  "action_type": "suggest_alternative",
+  "feasible": false,
+  "budget_ok": true,
+  "equipment_ok": false,
+  "reagents_ok": true,
+  "schedule_ok": true,
+  "staff_ok": true,
+  "suggested_technique": "manual_cell_counting",
+  "suggested_sample_size": 32,
+  "suggested_controls": ["vehicle_control", "positive_control"],
+  "explanation": "The plate reader is fully booked, so use manual counting and reduce the sample size to stay within the timeline."
+}
+```
+### ScientistObservation
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `paper_title` | `str` | yes | Study title |
+| `paper_hypothesis` | `str` | yes | Core hypothesis being replicated |
+| `paper_method` | `str` | yes | Short method summary |
+| `paper_key_finding` | `str` | yes | Main finding being targeted |
+| `experiment_goal` | `str` | yes | What the scientist is trying to preserve |
+| `conversation_history` | `list[ConversationEntry]` | yes | Empty list at reset |
+| `current_protocol` | `Protocol \| null` | yes | `null` until a protocol exists |
+| `round_number` | `int` | yes | Zero-based current round |
+| `max_rounds` | `int` | yes | Max allowed rounds in the episode |
+Canonical example:
+```json
+{
+  "paper_title": "Drug X reduces glioblastoma cell viability",
+  "paper_hypothesis": "Drug X reduces viability in a dose-dependent manner.",
+  "paper_method": "96-well viability assay with 24h incubation and absorbance readout.",
+  "paper_key_finding": "The highest dose reduced viability by about 40 percent.",
+  "experiment_goal": "Replicate the dose-response trend without dropping essential controls.",
+  "conversation_history": [],
+  "current_protocol": null,
+  "round_number": 0,
+  "max_rounds": 6
+}
+```
+### LabManagerObservation
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `budget_total` | `float` | yes | Initial budget for the episode |
+| `budget_remaining` | `float` | yes | Current remaining budget |
+| `equipment_available` | `list[str]` | yes | Equipment that can be used |
+| `equipment_booked` | `list[str]` | yes | Equipment unavailable due to booking |
+| `reagents_in_stock` | `list[str]` | yes | Available reagents |
+| `reagents_out_of_stock` | `list[str]` | yes | Required but unavailable reagents |
+| `staff_count` | `int` | yes | Available staff count |
+| `time_limit_days` | `int` | yes | Whole calendar days remaining |
+| `safety_restrictions` | `list[str]` | yes | Constraints such as banned solvents or assay restrictions |
+| `conversation_history` | `list[ConversationEntry]` | yes | Empty list at reset |
+| `current_protocol` | `Protocol \| null` | yes | `null` until a protocol exists |
+| `round_number` | `int` | yes | Zero-based current round |
+| `max_rounds` | `int` | yes | Max allowed rounds in the episode |
+Canonical example:
+```json
+{
+  "budget_total": 1200.0,
+  "budget_remaining": 1200.0,
+  "equipment_available": ["co2_incubator", "microscope"],
+  "equipment_booked": ["plate_reader"],
+  "reagents_in_stock": ["dmso", "drug_x", "culture_media"],
+  "reagents_out_of_stock": ["wst1"],
+  "staff_count": 2,
+  "time_limit_days": 7,
+  "safety_restrictions": ["no_radioactive_reagents"],
+  "conversation_history": [],
+  "current_protocol": null,
+  "round_number": 0,
+  "max_rounds": 6
+}
+```
+### Observation
+Wrapper behavior:
+- Serialized `Observation` objects always include both top-level keys: `scientist` and `lab_manager`.
+- In shared environment state, replay, and API payloads, both branches should normally be populated.
+- When a consumer is intentionally given only one role view, the non-owned branch must be `null`, not omitted.
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `scientist` | `ScientistObservation \| null` | yes | Scientist-side view |
+| `lab_manager` | `LabManagerObservation \| null` | yes | Lab-manager-side view |
+Canonical example:
+```json
+{
+  "scientist": {
+    "paper_title": "Drug X reduces glioblastoma cell viability",
+    "paper_hypothesis": "Drug X reduces viability in a dose-dependent manner.",
+    "paper_method": "96-well viability assay with 24h incubation and absorbance readout.",
+    "paper_key_finding": "The highest dose reduced viability by about 40 percent.",
+    "experiment_goal": "Replicate the dose-response trend without dropping essential controls.",
+    "conversation_history": [],
+    "current_protocol": null,
+    "round_number": 0,
+    "max_rounds": 6
+  },
+  "lab_manager": {
+    "budget_total": 1200.0,
+    "budget_remaining": 1200.0,
+    "equipment_available": ["co2_incubator", "microscope"],
+    "equipment_booked": ["plate_reader"],
+    "reagents_in_stock": ["dmso", "drug_x", "culture_media"],
+    "reagents_out_of_stock": ["wst1"],
+    "staff_count": 2,
+    "time_limit_days": 7,
+    "safety_restrictions": ["no_radioactive_reagents"],
+    "conversation_history": [],
+    "current_protocol": null,
+    "round_number": 0,
+    "max_rounds": 6
+  }
+}
+```
+### StepResult
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `observation` | `Observation \| null` | yes | Present on normal steps and terminal steps; may be `null` only on hard failure |
+| `reward` | `float` | yes | Episode reward after the step; terminal reward on final step |
+| `done` | `bool` | yes | Whether the episode is terminal |
+| `info` | `dict` | yes | Flexible metadata object |
+Reserved `info` keys:
+- `agreement_reached`: `bool`
+- `error`: `str | null`
+- `reward_breakdown`: `RewardBreakdown | null`
+- `judge_notes`: `str | null`
+- `verdict`: `str | null`
+Canonical example:
+```json
+{
+  "observation": {
+    "scientist": null,
+    "lab_manager": null
+  },
+  "reward": 6.72,
+  "done": true,
+  "info": {
+    "agreement_reached": true,
+    "error": null,
+    "reward_breakdown": {
+      "rigor": 0.9,
+      "feasibility": 0.8,
+      "fidelity": 0.85,
+      "efficiency_bonus": 0.25,
+      "communication_bonus": 0.15,
+      "penalties": {
+        "invalid_action": 0.0,
+        "timeout": 0.0
+      }
+    },
+    "judge_notes": "Controls were preserved and the substitutions remained scientifically acceptable.",
+    "verdict": "accept"
+  }
+}
+```
+### EpisodeState
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `seed` | `int` | yes | Deterministic episode seed |
+| `scenario_template` | `str` | yes | Scenario family identifier |
+| `difficulty` | `str` | yes | `easy`, `medium`, or `hard` |
+| `paper_title` | `str` | yes | Study title |
+| `paper_hypothesis` | `str` | yes | Core hypothesis |
+| `paper_method` | `str` | yes | Method summary |
+| `paper_key_finding` | `str` | yes | Main finding |
+| `experiment_goal` | `str` | yes | Goal preserved through negotiation |
+| `lab_budget_total` | `float` | yes | Initial budget |
+| `lab_budget_remaining` | `float` | yes | Remaining budget |
+| `lab_equipment` | `list[str]` | yes | Equipment state |
+| `lab_reagents` | `list[str]` | yes | Reagent state |
+| `lab_staff_count` | `int` | yes | Available staff count |
+| `lab_time_limit_days` | `int` | yes | Whole calendar days remaining |
+| `current_protocol` | `Protocol \| null` | yes | Current agreed or latest proposed protocol |
+| `conversation_history` | `list[ConversationEntry]` | yes | Negotiation history |
+| `round_number` | `int` | yes | Zero-based round counter |
+| `max_rounds` | `int` | yes | Maximum rounds allowed |
+| `done` | `bool` | yes | Terminal flag |
+| `agreement_reached` | `bool` | yes | Whether both sides reached agreement |
+| `reward` | `float` | yes | Final total reward or `0.0` until terminal scoring |
+| `rigor_score` | `float` | yes | Final component score or `0.0` until terminal scoring |
+| `feasibility_score` | `float` | yes | Final component score or `0.0` until terminal scoring |
+| `fidelity_score` | `float` | yes | Final component score or `0.0` until terminal scoring |
+Canonical example:
+```json
+{
+  "seed": 17,
+  "scenario_template": "cell_biology",
+  "difficulty": "medium",
+  "paper_title": "Drug X reduces glioblastoma cell viability",
+  "paper_hypothesis": "Drug X reduces viability in a dose-dependent manner.",
+  "paper_method": "96-well viability assay with 24h incubation and absorbance readout.",
+  "paper_key_finding": "The highest dose reduced viability by about 40 percent.",
+  "experiment_goal": "Replicate the dose-response trend without dropping essential controls.",
+  "lab_budget_total": 1200.0,
+  "lab_budget_remaining": 850.0,
+  "lab_equipment": ["co2_incubator", "microscope"],
+  "lab_reagents": ["dmso", "drug_x", "culture_media"],
+  "lab_staff_count": 2,
+  "lab_time_limit_days": 7,
+  "current_protocol": {
+    "sample_size": 32,
+    "controls": ["vehicle_control", "positive_control"],
+    "technique": "manual_cell_counting",
+    "duration_days": 5,
+    "required_equipment": ["microscope", "co2_incubator"],
+    "required_reagents": ["dmso", "drug_x", "culture_media"],
+    "rationale": "Uses available equipment while preserving control structure."
+  },
+  "conversation_history": [
+    {
+      "role": "scientist",
+      "message": "I propose a manual counting protocol that keeps both controls.",
+      "round_number": 0,
+      "action_type": "propose_protocol"
+    }
+  ],
+  "round_number": 1,
+  "max_rounds": 6,
+  "done": false,
+  "agreement_reached": false,
+  "reward": 0.0,
+  "rigor_score": 0.0,
+  "feasibility_score": 0.0,
+  "fidelity_score": 0.0
+}
+```
+### EpisodeLog
+| Field | Type | Required | Notes |
+| --- | --- | --- | --- |
+| `episode_id` | `str` | yes | Stable replay identifier |
+| `seed` | `int` | yes | Episode seed |
+| `scenario_template` | `str` | yes | Scenario family identifier |
+| `difficulty` | `str` | yes | `easy`, `medium`, or `hard` |
+| `final_state` | `EpisodeState \| null` | yes | Must be populated for completed episodes |
+| `transcript` | `list[ConversationEntry]` | yes | Replayable transcript |
+| `reward_breakdown` | `RewardBreakdown` | yes | Final reward components |
+| `total_reward` | `float` | yes | Final total reward |
+| `rounds_used` | `int` | yes | Number of completed rounds |
+| `agreement_reached` | `bool` | yes | Final agreement flag |
+| `judge_notes` | `str` | yes | Human-readable audit summary |
+| `verdict` | `str` | yes | One of `accept`, `revise`, `reject` |
+Canonical example:
+```json
+{
+  "episode_id": "cell_biology-17-medium-0001",
+  "seed": 17,
+  "scenario_template": "cell_biology",
+  "difficulty": "medium",
+  "final_state": {
+    "seed": 17,
+    "scenario_template": "cell_biology",
+    "difficulty": "medium",
+    "paper_title": "Drug X reduces glioblastoma cell viability",
+    "paper_hypothesis": "Drug X reduces viability in a dose-dependent manner.",
+    "paper_method": "96-well viability assay with 24h incubation and absorbance readout.",
+    "paper_key_finding": "The highest dose reduced viability by about 40 percent.",
+    "experiment_goal": "Replicate the dose-response trend without dropping essential controls.",
+    "lab_budget_total": 1200.0,
+    "lab_budget_remaining": 850.0,
+    "lab_equipment": ["co2_incubator", "microscope"],
+    "lab_reagents": ["dmso", "drug_x", "culture_media"],
+    "lab_staff_count": 2,
+    "lab_time_limit_days": 7,
+    "current_protocol": {
+      "sample_size": 32,
+      "controls": ["vehicle_control", "positive_control"],
+      "technique": "manual_cell_counting",
+      "duration_days": 5,
+      "required_equipment": ["microscope", "co2_incubator"],
+      "required_reagents": ["dmso", "drug_x", "culture_media"],
+      "rationale": "Uses available equipment while preserving control structure."
+    },
+    "conversation_history": [
+      {
+        "role": "scientist",
+        "message": "I propose a manual counting protocol that keeps both controls.",
+        "round_number": 0,
+        "action_type": "propose_protocol"
+      },
+      {
+        "role": "lab_manager",
+        "message": "This alternative is feasible with current equipment and budget.",
+        "round_number": 0,
+        "action_type": "accept"
+      }
+    ],
+    "round_number": 1,
+    "max_rounds": 6,
+    "done": true,
+    "agreement_reached": true,
+    "reward": 6.72,
+    "rigor_score": 0.9,
+    "feasibility_score": 0.8,
+    "fidelity_score": 0.85
+  },
+  "transcript": [
+    {
+      "role": "scientist",
+      "message": "I propose a manual counting protocol that keeps both controls.",
+      "round_number": 0,
+      "action_type": "propose_protocol"
+    },
+    {
+      "role": "lab_manager",
+      "message": "This alternative is feasible with current equipment and budget.",
+      "round_number": 0,
+      "action_type": "accept"
+    }
+  ],
+  "reward_breakdown": {
+    "rigor": 0.9,
+    "feasibility": 0.8,
+    "fidelity": 0.85,
+    "efficiency_bonus": 0.25,
+    "communication_bonus": 0.15,
+    "penalties": {
+      "invalid_action": 0.0,
+      "timeout": 0.0
+    }
+  },
+  "total_reward": 6.72,
+  "rounds_used": 1,
+  "agreement_reached": true,
+  "judge_notes": "Controls were preserved and the substitutions remained scientifically acceptable.",
+  "verdict": "accept"
+}
+```
+## Sign-off
+| Owner | Status | Notes |
+| --- | --- | --- |
+| Person B (Ayush) | signed off | Draft matches current stubs and downstream parser needs |
+| Kian (Person A) | signed off | Validator and environment-owner review completed; contract is frozen for `MOD 01`, `MOD 03`, `FND 09`, and downstream parser work |

docs/future_improvements.md ADDED Viewed

	@@ -0,0 +1,304 @@

+# Future Improvements
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+This document tracks post-MVP architectural improvements. Work here begins
+only after the core logic is complete and the hackathon deliverables are
+stable.
+---
+## 1. Domain-Agnostic Normalized Scenario Layer
+### Priority: highest future feature
+### Problem
+The current models in `replicalab/models.py` use domain-biased field names:
+- `paper_title`, `paper_hypothesis`, `paper_method`, `paper_key_finding`
+- `equipment_available`, `reagents_in_stock`, `staff_count`
+- `sample_size`, `controls`, `technique`
+These work for the three MVP scenario families (cell biology, ML benchmark,
+behavioral psychology) because all three map onto a lab-style replication
+frame. But if the environment needs to support domains outside scientific
+replication (e.g., engineering design, clinical trial planning, supply chain
+optimization), the field names stop making sense.
+The turn protocol itself (`propose`, `revise`, `request_info`, `accept`) is
+already generic. The gap is in the observation and protocol content layer.
+### Solution: normalized scenario representation
+Introduce a structured internal representation that any domain adapter can
+emit:
+```python
+class NormalizedScenarioPack(BaseModel):
+    domain_id: str                          # "cell_biology", "ml_benchmark", etc.
+    task_summary: str                       # what the agent is trying to achieve
+    success_criteria: list[str]             # measurable conditions for success
+    constraints: list[Constraint]           # budget, time, equipment, policy, etc.
+    resources: list[Resource]               # what is available to work with
+    allowed_substitutions: list[Substitution]  # valid swaps the agent can propose
+    hidden_reference_spec: dict             # ground truth the judge scores against
+    difficulty: str                         # "easy", "medium", "hard"
+    metadata: dict                          # domain-specific extras
+```
+Where:
+```python
+class Constraint(BaseModel):
+    dimension: str          # "budget", "time", "equipment", "personnel", "safety"
+    label: str              # human-readable name
+    value: Any              # the constraint value (numeric, list, etc.)
+    hard: bool = True       # hard constraint vs soft preference
+class Resource(BaseModel):
+    category: str           # "equipment", "reagent", "compute", "personnel"
+    name: str               # resource identifier
+    available: bool         # currently available
+    quantity: Optional[int] # count if applicable
+    notes: str = ""         # booking conflicts, expiry, etc.
+class Substitution(BaseModel):
+    original: str           # what the reference spec uses
+    replacement: str        # what the agent can use instead
+    quality_impact: float   # 0.0 to 1.0, how much fidelity is lost
+    cost_delta: float       # cost difference
+```
+### Architecture principle
+```
+Domain template
+    -> Scenario adapter (thin mapper, <50 lines per domain)
+        -> NormalizedScenarioPack
+            -> Observation mapper (fills ScientistObservation / LabManagerObservation)
+            -> Prompt assembler (data-driven, not hard-coded)
+            -> Validator (checks action against constraints)
+            -> Scorer (compares final protocol against hidden_reference_spec)
+```
+The external contract (`ScientistAction`, `LabManagerAction`,
+`ScientistObservation`, `LabManagerObservation`, `StepResult`) stays
+unchanged. The normalization lives below those models as an internal
+implementation layer.
+LLMs reason and negotiate. They never own truth. Truth lives in the
+normalized scenario pack and the deterministic scorer.
+### How this affects the future core logic
+| Current component | Impact | Severity |
+|---|---|---|
+| `replicalab/models.py` | External contract unchanged. Add `NormalizedScenarioPack` and helper models as new classes | Low |
+| `replicalab/scenarios/templates.py` (SCN 02) | Must define the normalized schema. `generate_scenario()` returns a pack instead of raw dicts | High |
+| `replicalab/scenarios/*.py` (SCN 03-05) | Each domain file becomes a thin scenario adapter that emits a normalized pack | Medium |
+| `replicalab/scenarios/templates.py` (SCN 06) | Difficulty scaling becomes mechanical: add/remove constraints, tighten resource limits | Medium, but simpler |
+| `replicalab/scenarios/templates.py` (SCN 07) | Constraint generator emits `Constraint` objects instead of ad hoc lab fields | High |
+| `replicalab/scenarios/templates.py` (SCN 08) | `hidden_reference_spec` is part of the pack, not a separate hidden structure | Medium |
+| `replicalab/utils/validation.py` (MOD 05-06) | Validators read `constraints[]` and `resources[]` from the pack instead of checking lab-specific fields | High |
+| `replicalab/scoring/*.py` (JDG 01-04) | Scorers compare the final protocol against `hidden_reference_spec` on normalized dimensions | High |
+| `replicalab/env/replicalab_env.py` (ENV 01-07) | `EpisodeState` gains a `scenario_pack` field. Reset populates it from the adapter | Medium |
+| `replicalab/agents/scientist_policy.py` (AGT 01-02) | Prompts assembled from scenario pack data, not hard-coded domain text | Medium |
+| `replicalab/agents/lab_manager_policy.py` (AGT 05-07) | Feasibility checker reads normalized constraints instead of lab-specific fields | Medium |
+| `frontend/` (UI 01+) | Render "constraint cards" and "resource cards" instead of lab-specific panels | Low (future) |
+### What stays the same
+- The turn protocol (`propose`, `revise`, `request_info`, `accept`)
+- The reward formula (`10 * rigor * feasibility * fidelity + bonuses - penalties`)
+- The external API contract (REST + WebSocket payloads)
+- The training loop and RL pipeline
+- The deterministic reward principle
+---
+## 2. Planned work items for the normalized scenario layer
+### Item 1: Define the normalized scenario schema
+**What:** Add `NormalizedScenarioPack`, `Constraint`, `Resource`, and
+`Substitution` as Pydantic models in a new file
+`replicalab/scenarios/schema.py`.
+**Why:** This is the foundation. Every other item depends on having a stable
+schema that all adapters, validators, and scorers agree on.
+**Depends on:** Core MVP scenario work (SCN 02-09) being complete so we know
+what fields the adapters actually need.
+**Scope:** ~80 lines of model definitions, no business logic.
+---
+### Item 2: Convert existing scenario templates into adapters
+**What:** Refactor `cell_biology.py`, `ml_benchmark.py`, and
+`behavioral_psych.py` so each one returns a `NormalizedScenarioPack` instead
+of raw domain-specific dicts.
+**Why:** Proves the schema works for all three MVP domains. If a field cannot
+be cleanly mapped, the schema needs revision before adding new domains.
+**Depends on:** Item 1 (schema exists), SCN 03-05 (domain templates exist).
+**Scope:** ~50 lines per adapter. Should be thin mappers. If an adapter
+exceeds 50 lines, the schema is wrong.
+**Constraint:** The existing observation fields (`paper_title`,
+`equipment_available`, etc.) must still be populated. The adapter fills
+both the normalized pack and the legacy observation slots until the
+observation models are generalized.
+---
+### Item 3: Build data-driven prompt assembly
+**What:** Replace hard-coded prompt text with a template that assembles from
+the scenario pack:
+```
+You are a {role} working on: {task_summary}
+Success criteria:
+{success_criteria[]}
+You must work within these constraints:
+{constraints[].label}: {constraints[].value}
+Available resources:
+{resources[].name} ({resources[].category}): {available/unavailable}
+```
+**Why:** Makes AGT 01 (Scientist prompt) and AGT 07 (Lab Manager templates)
+domain-neutral. Adding a new domain requires only a new adapter, not new
+prompts.
+**Depends on:** Item 2 (adapters produce normalized packs), AGT 01 and
+AGT 07 existing in their MVP form.
+**Scope:** One prompt template function per role. ~40 lines each.
+---
+### Item 4: Hybrid LLM Lab Manager with deterministic post-checking
+**What:** Replace the rule-based Lab Manager with a hybrid architecture:
+1. LLM receives the `LabManagerObservation` and generates negotiation text
+   plus alternative suggestions in natural language
+2. Deterministic constraint checker computes the real feasibility flags by
+   reading the normalized scenario pack's `constraints[]` and `resources[]`
+3. A composer merges the LLM output with the checker output into a valid
+   `LabManagerAction`
+4. The `model_validator` on `LabManagerAction` catches any inconsistency
+**Why:** Gives the Lab Manager realistic negotiation language and creative
+suggestions (the LLM's strength) while keeping feasibility flags truthful
+(the checker's strength). Training reward stays deterministic because the
+reward engine only reads the validated action, not the LLM's raw text.
+**Depends on:** Item 2 (checker needs normalized constraints), AGT 05
+(feasibility checker exists), MOD 02 (LabManagerAction validators exist).
+**Scope:** ~120 lines. The LLM call, the checker, the composer. Uses the
+same base model as the Scientist (Qwen3-4B) with a separate role adapter.
+**Risk:** Episode variance increases because the same seed may produce
+different negotiation paths. Mitigate by keeping the deterministic checker as
+the authority on all boolean flags. The LLM only controls `explanation` text
+and suggestion ideas, never the truth flags.
+---
+### Item 5: Normalized scoring against hidden reference spec
+**What:** Refactor the scoring engine so `score_rigor()`,
+`score_feasibility()`, and `score_fidelity()` compare the final protocol
+against `hidden_reference_spec` from the normalized scenario pack instead of
+using domain-specific scoring logic.
+Scoring dimensions become:
+- **Rigor:** Does the protocol preserve the success criteria? Compare
+  `protocol.controls` against `hidden_reference_spec.required_controls`,
+  check sample size ratio, verify statistical validity markers.
+- **Feasibility:** Does the protocol satisfy all hard constraints? Walk
+  `constraints[]` and check each one against the protocol.
+- **Fidelity:** How close is the protocol to the reference spec? Compare
+  technique, duration, equipment, reagents against
+  `hidden_reference_spec` and compute a similarity score using
+  `allowed_substitutions[]` quality impact.
+**Why:** Makes scoring work for any domain without per-domain scorer code.
+The domain-specific knowledge lives in the scenario adapter (which defines
+what the reference spec and constraints are), not in the scoring engine.
+**Depends on:** Item 1 (schema with `hidden_reference_spec`), Item 2
+(adapters populate it), JDG 01-04 (MVP scorers exist to refactor from).
+**Scope:** Refactor of existing scorer files. ~150 lines total across
+`rigor.py`, `feasibility.py`, `fidelity.py`.
+---
+### Item 6: Lab Manager orchestrator with specialist subagents
+**What:** Decompose the hybrid Lab Manager into a coordinator that delegates
+to specialist subagents:
+| Subagent | Responsibility |
+|---|---|
+| Budget agent | Checks cost against remaining budget |
+| Scheduling agent | Checks timeline and booking conflicts |
+| Equipment agent | Checks equipment availability and substitutions |
+| Safety agent | Checks policy and compliance constraints |
+| Coordinator | Aggregates subagent outputs into one `LabManagerAction` |
+Externally, the contract is unchanged: one `LabManagerAction` per turn. The
+orchestration is internal.
+**Why:** Stronger multi-agent story for the hackathon track alignment.
+Demonstrates that the Lab Manager is not a monolithic policy but a team of
+constraint specialists. Each subagent can be individually tested, improved,
+or replaced.
+**Depends on:** Item 4 (hybrid Lab Manager works first), Item 2 (normalized
+constraints are available for each subagent to read).
+**Scope:** Orchestration layer ~200 lines. Each subagent ~40 lines. Total
+~400 lines.
+**Risk:** Adds latency (multiple LLM calls or multiple checker passes per
+turn), orchestration failure handling, and logging complexity. Only pursue
+after the single hybrid Lab Manager is stable and training is producing
+results.
+**Phasing:** This is the lowest priority item. Build it only if the MVP is
+complete, training shows improvement, and there is time remaining before
+submission.
+---
+## 3. Recommended order
+| Order | Item | Gate |
+|---|---|---|
+| 1 | Define normalized scenario schema | After SCN 02-09 complete |
+| 2 | Convert templates to adapters | After Item 1 |
+| 3 | Data-driven prompt assembly | After Item 2 + AGT 01/07 |
+| 4 | Hybrid LLM Lab Manager | After Item 2 + AGT 05 |
+| 5 | Normalized scoring | After Item 2 + JDG 01-04 |
+| 6 | Lab Manager orchestrator with subagents | After Item 4 stable |
+---
+## 4. Key principle
+The external contract stays stable. Internal policy can evolve. LLMs reason
+and negotiate. They never own truth. Truth lives in the normalized scenario
+pack and the deterministic scorer.

docs/kian/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Kian Folder
+This folder holds Kian's planning docs for Person A-owned work.
+Expected files:
+- `task_list.md`
+- `task_breakdown.md`
+- `notes.md`

docs/kian/notes.md ADDED Viewed

	@@ -0,0 +1,6 @@

+# Person A Notes
+Use this file for working notes and short-term reminders.
+Durable deviations belong in `docs/changes.md`.

docs/kian/task_breakdown.md ADDED Viewed

	@@ -0,0 +1,40 @@

+# Kian (Person A) Task Breakdown
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## Current status
+- `FND 04`, `FND 08`, `FND 09`, `MOD 01` to `MOD 05`, `MOD 11`, `MOD 12` are complete
+- Shared `AGT 05` is now complete, so the deterministic feasibility layer exists for both the Lab Manager path and the judge feasibility score
+- `SCN 01` to `SCN 10` are complete, so the deterministic scenario layer exists in code
+- `ENV 01` to `ENV 08` are all complete — the full environment lifecycle (reset, step, validate, Lab Manager response, termination, judge scoring, state snapshot, close) works end-to-end
+- `JDG 01` to `JDG 06` plus `JDG 08` are complete — the deterministic reward pipeline is wired, the plain-English explanation layer exists, and the reward stack now has stronger regression coverage for ordering, substitution behavior, partial feasibility credit, and breakdown determinism
+- `TST 01` to `TST 05` are complete with 36 env tests and 40 reward tests passing
+- `MOD 06`, `SCN 13`, `AGT 09`, `JDG 11`, `ENV 11`, `ENV 10`, and `OBS 04` are now complete, so the remaining Kian work is the blocked schema follow-on
+Bounded-tool scope note:
+1. Kian-owned scenario, judge, and environment tasks now need to support
+   bounded `search`, `code_check`, and `image_inspection` traces without
+   changing the outer action contract.
+2. Training reward must remain deterministic and must not depend on live web.
+3. Frozen evidence packs are the default training-time source of tool inputs.
+4. Audio remains out of scope.
+---
+## Recommended execution order
+1. `MOD 08` -- add schema and validator unit-test expansion
+---
+## Why this order
+- `SCN 13` is complete, so the normalized scenario layer now carries booking and scheduling conflicts as structured deterministic data.
+- `AGT 09` is complete, so the grounded Lab Manager checker, suggestion, and response stack now has deterministic regression coverage.
+- `JDG 11` is complete and `ENV 11` is now integrated, so terminal env outputs and replay-facing state carry the canonical audit payload end to end.
+- `ENV 10` and `OBS 04` are now complete, so the environment stack has deterministic replay and broader regression coverage on top of the completed ENV 01-08 and ENV 11 lifecycle.
+- `MOD 08` is the only remaining Kian-owned implementation task, and it is now fully unblocked.

docs/kian/task_list.md ADDED Viewed

	@@ -0,0 +1,79 @@

+# Kian (Person A) Task List
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## Current status
+- `FND 04`, `FND 08`, and `FND 09` are complete
+- `MOD 01`, `MOD 02`, `MOD 03`, `MOD 04`, `MOD 05`, `MOD 11`, and `MOD 12` are complete
+- Shared `AGT 05` is now complete through Ayush's implementation of the deterministic feasibility checker
+- `SCN 01` to `SCN 10` are now complete in the repo
+- The normalized scenario pack, seeded generation, difficulty scaling, and three initial domain families are already present
+- `ENV 01` to `ENV 08` are now complete, so the full environment lifecycle (reset, step, validate, Lab Manager response, termination, judge scoring, state snapshot, close) works end-to-end
+- `JDG 01` to `JDG 06` are now complete, so the deterministic reward pipeline and plain-English explanation layer are fully wired
+- `TST 01` to `TST 05` are now complete, with 36 env tests and 40 reward tests passing
+- `MOD 06`, `JDG 06`, `JDG 08`, `SCN 13`, `AGT 09`, and `JDG 11` are now complete
+- `JDG 06` now also unblocks Ayush's `AGT 10`
+- `ENV 11`, `ENV 10`, and `OBS 04` are now complete
+- `MOD 08` is now fully unblocked after `MOD 07` was completed
+---
+## Immediate next tasks
+- [ ] **MOD 08** | Add schema and validator unit-test expansion | 0.75h | Depends: all prerequisites complete
+---
+## Foundation and scenario tasks already landed
+- [x] **FND 04** | Completed by Person B (Ayush)
+- [x] **FND 08** | Completed with shared sign-off
+- [x] **FND 09** | Completed by Person B (Ayush)
+- [x] **MOD 01** | Completed by Person B (Ayush)
+- [x] **MOD 02** | Completed by Person B (Ayush)
+- [x] **MOD 03** | Completed by Person B (Ayush)
+- [x] **MOD 04** | Completed by Person B (Ayush)
+- [x] **MOD 05** | Completed by Person B (Ayush)
+- [x] **MOD 06** | Completed by Person B (Ayush)
+- [x] **MOD 11** | Completed by Person B (Ayush)
+- [x] **MOD 12** | Completed by Person B (Ayush)
+- [x] **AGT 05** | Completed by Person B (Ayush)
+- [x] **SCN 01** | Completed by Person B (Ayush)
+- [x] **SCN 02** | Completed by Person B (Ayush)
+- [x] **SCN 03** | Completed by Person B (Ayush)
+- [x] **SCN 04** | Completed by Person B (Ayush)
+- [x] **SCN 05** | Completed by Person B (Ayush)
+- [x] **SCN 06** | Completed by Person B (Ayush)
+- [x] **SCN 07** | Completed by Person B (Ayush)
+- [x] **SCN 08** | Completed by Person B (Ayush)
+- [x] **SCN 09** | Completed by Person B (Ayush)
+- [x] **SCN 10** | Completed by Person B (Ayush)
+- [x] **SCN 13** | Completed by Person B (Ayush)
+- [x] **ENV 01** | Completed by Person B (Ayush)
+- [x] **ENV 02** | Completed by Person B (Ayush)
+- [x] **ENV 03** | Completed by Person B (Ayush)
+- [x] **ENV 04** | Completed by Person B (Ayush)
+- [x] **ENV 05** | Completed by Person B (Ayush)
+- [x] **ENV 06** | Completed by Person B (Ayush)
+- [x] **ENV 07** | Completed by Person B (Ayush)
+- [x] **ENV 08** | Completed by Person B (Ayush)
+- [x] **ENV 10** | Completed by Person B (Ayush)
+- [x] **ENV 11** | Completed by Person B (Ayush)
+- [x] **OBS 04** | Completed by Person B (Ayush)
+- [x] **JDG 01** | Completed by Person B (Ayush)
+- [x] **JDG 02** | Completed by Person B (Ayush)
+- [x] **JDG 03** | Completed by Person B (Ayush)
+- [x] **JDG 04** | Completed by Person B (Ayush)
+- [x] **JDG 05** | Completed by Person B (Ayush)
+- [x] **JDG 06** | Completed by Person B (Ayush)
+- [x] **JDG 08** | Completed by Person B (Ayush)
+- [x] **JDG 11** | Completed by Person B (Ayush)
+- [x] **AGT 09** | Completed by Person B (Ayush)
+- [x] **TST 01** | Completed by Person B (Ayush)
+- [x] **TST 02** | Completed by Person B (Ayush)
+- [x] **TST 03** | Completed by Person B (Ayush)
+- [x] **TST 04** | Completed by Person B (Ayush)
+- [x] **TST 05** | Completed by Person B (Ayush)

docs/kush/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Kush Folder
+This folder holds Kush's planning docs for Person D-owned work.
+Expected files:
+- `task_list.md`
+- `task_breakdown.md`
+- `notes.md`

docs/kush/notes.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# Person D Notes
+Use this file for working notes and short-term reminders.
+Durable deviations belong in `docs/changes.md`.
+---
+## 2026-03-08 demo-flow refinement
+- Dashboard now frames the product as `paper -> brief -> negotiate -> judge -> train`.
+- Episode page now foregrounds the source paper and explicitly connects the terminal judge result to the training loop.
+- Controls now read as replication setup instead of generic episode controls.
+- Compare page is positioned as a seeded evaluation bench rather than the primary training-results story.
+- The frontend default step action is now scenario-aware, so the live episode path produces valid judged runs instead of immediate invalid-action penalties on ML cases.
+- The negotiation panel now shows an explicit `Advance First Round` CTA so a newly reset episode no longer looks frozen at `0 messages`.
+- The dashboard `Replicate a Paper` CTA now launches a seeded live demo automatically: reset, first proposal, autoplay, and judged completion all happen without extra clicks.
+- The replication setup card now performs a backend health check up front and surfaces a concrete startup command instead of the opaque browser-level `Failed to fetch` message when the API server is down.
+## 2026-03-08 three-outcome live demo
+- The live demo now has three seeded story modes on the dashboard: `fast-agreement`, `learning-opportunity`, and `no-agreement`.
+- Each mode runs against the real backend with deterministic episode data and renders a post-episode results report instead of stopping at a generic terminal state.
+- The results report now shows executed rounds, disagreement count, replicability score, paper reliability quality, reward and score charts, training interpretation, and next-tool suggestions.
+- Verified backend-driven outputs for the current seeded ML demo cases:
+  - `fast-agreement` -> round `2`, verdict `accept`, cumulative reward `2.906845`
+  - `learning-opportunity` -> round `6`, verdict `accept`, cumulative reward `4.537097`
+  - `no-agreement` -> round `6`, verdict `timeout`, cumulative reward `0.366529`
+## 2026-03-08 training page with real artifacts
+- Added a dedicated `/training` page instead of relying on the old packaged dashboard card.
+- The new page is backed by real artifact values from the existing outputs:
+  - local deterministic baseline summary
+  - live ART/OpenEnv scientist checkpoints
+  - seeded hold-out compare summary
+  - scientist and lab-manager preview summaries
+- The training story is now explicit and honest:
+  - the training pipeline works
+  - live reward moved positive by later checkpoints
+  - hold-out compare still shows the trained Scientist underperforming baseline
+  - more training and parser/invalid-action cleanup are still needed
+- Header nav now includes `Training`, dashboard training CTA points there, and the dashboard training teaser uses the same artifact-backed data.
+## 2026-03-08 automated demo video build
+- Added `scripts/build_demo_video.py` to synthesize an ElevenLabs voiceover from `.env`, capture clean frontend screenshots, generate captioned slides, and build the final mp4 with `ffmpeg`.
+- Added `docs/demo_video_script_60s.md` as the canonical one-minute narration and shot list.
+- Generated the current outputs under `replicalab/outputs/demo_video/`:
+  - `audio/voiceover.mp3`
+  - `replicalab_demo_60s.mp4`
+  - `text/voiceover.txt`
+  - `text/voiceover.srt`
+## 2026-03-08 Hugging Face Space redeploy
+- Investigated the public Space after it showed only the backend landing page instead of the React app.
+- Confirmed the repo already had the correct multi-stage Dockerfile and SPA-serving `server/app.py`, but the runtime SHA was still pinned to an older backend-only container.
+- Synced the current app files to `ayushozha/replicalab` through the Hugging Face API, restarted the Space, and waited for the runtime SHA to advance to the new repo revision.
+- Reverified:
+  - `https://ayushozha-replicalab.hf.space/` now serves the React frontend
+  - `https://ayushozha-replicalab.hf.space/episode?...` returns `200`
+  - `https://ayushozha-replicalab.hf.space/health` still reports `{\"status\":\"ok\",\"env\":\"real\",\"version\":\"0.1.0\"}`
+## 2026-03-08 policy-results clarification page
+- Added a dedicated `/policies` frontend route for the question: baseline vs trained vs oracle.
+- The new page makes the current runtime explicit:
+  - `/compare` is still the seeded deterministic benchmark bench
+  - the public app is not currently mounting the trained Scientist adapter
+  - the public app is not currently mounting the Anthropic oracle path
+  - the Judge remains deterministic
+- Updated `/compare` with a callout so it no longer implies that it is already comparing live mounted model policies.
+## 2026-03-08 localhost model-backed Scientist mode
+- Added live runtime detection to the episode flow through `/runtime`.
+- Non-demo localhost episodes now prefer the backend `/agent-step` route instead of the frontend default action builder when a model runtime is available.
+- The episode page now surfaces the current Scientist runtime directly in the UI so it is clear whether localhost is using baseline or a model-backed path.
+- Current live localhost mode is `ollama` with `glm-5:cloud`.
+- Anthropic-backed Scientist mode exists in code, but the current Anthropic account cannot run live due to insufficient API credits, so localhost falls back to the Ollama runtime for real model-driven stepping.
+## 2026-03-08 dynamic live-run and judge-caveat cleanup
+- The main dashboard CTA no longer launches the same fixed seeded flow every time.
+- `Replicate a Random Paper` now generates a fresh seeded route with a random scenario family, difficulty, and seed, then autostarts the live episode path.
+- The three fixed cards remain available, but are now labeled as scripted outcomes rather than the default live experience.
+- Accepted verdicts that still carry weak-component reasons are now shown as `Accept with caveats` in the judge-facing UI instead of `Accept` plus a contradictory `Failure Reasons` block.
+- The results page now reports those cases as conditional replication candidates rather than clean wins.
+- The stage animation and completion toast now treat accepted-with-caveats runs as partial wins instead of full celebratory successes.
+- Live reset verification confirmed the random path can surface distinct paper briefs across scenario families, including CIFAR-10 replication and offline mean-reversion backtest cases.

docs/kush/task_breakdown.md ADDED Viewed

	@@ -0,0 +1,40 @@

+# Kush (Person D) Task Breakdown
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## Current situation
+- Person D's backlog is complete in the main tracker.
+- The frontend narrative has now been tightened around the actual hackathon demo:
+  1. show the paper
+  2. show the parsed replication task
+  3. watch the Scientist and Lab Manager negotiate
+  4. reveal the deterministic judge verdict
+  5. connect that verdict to training
+- The current frontend build is green after cleaning pre-existing strict TypeScript issues in several imported UI components.
+---
+## Demo-order guidance
+1. Dashboard
+   Show the four-step paper-to-training flow and use `Replicate a Paper`.
+2. Episode page
+   Keep the audience on the source paper and benchmark context first, then let the conversation and score panels do the live work.
+3. Episode end state
+   Use the training callout to explain why the judge output matters beyond the demo.
+4. Training panel
+   Reference the minimal Colab notebook and fixed-seed evaluation framing.
+5. Compare page
+   Position it as a seeded evaluation bench for additional cases.
+---
+## Remaining practical polish
+- If a final live training run is ready, replace the packaged demo comparison data in `TrainingResults.tsx`.
+- Capture updated screenshots or footage from the new dashboard and episode layouts.
+- Keep README/demo copy aligned with the same paper-to-training sequence.
+- Keep the backend health check visible on the setup card so live demos fail loudly and instructively if the API server is not running.

docs/kush/task_list.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Kush (Person D) Task List
+Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
+---
+## Current status
+- All Person D implementation and storytelling tasks are recorded complete in the source-of-truth backlog.
+- The frontend now presents the demo in the intended order:
+  - source paper
+  - parsed replication brief
+  - live negotiation
+  - deterministic judge
+  - training story
+- The dashboard, episode page, training panel, and evaluation bench all build successfully after the latest refinement pass.
+---
+## Active focus
+- No open Person D implementation blockers remain in the backlog.
+- Remaining polish is demo execution quality:
+  - keep the live script aligned with the new paper-to-training UI flow
+  - swap packaged training demo data for live artifacts if a final run is ready
+  - capture final screenshots or footage from the updated frontend
+---
+## Notes for demo prep
+- Start the live walkthrough from `/episode?template=ml_benchmark&difficulty=medium`.
+- Use the left panel to anchor the narrative in the source paper and parsed brief.
+- Use the right-side training callout at episode end to connect the judged reward to the minimal Colab notebook.
+- Use `/compare` as the seeded evaluation bench, not as the primary baseline-vs-trained story.

docs/map/README.md ADDED Viewed

	@@ -0,0 +1,122 @@

+# ReplicaLab Project Map
+> Living reference of every module, class, function, and relationship.
+> Updated after each implementation session.
+>
+> **Last updated:** 2026-03-07 (JDG 01-03 scoring implemented)
+## Module Index
+| File | What it covers |
+|------|---------------|
+| [models.md](models.md) | Data contracts — actions, observations, protocol, reward, episode state |
+| [scenarios.md](scenarios.md) | Scenario generation — templates, constraints, resources, hidden specs |
+| [agents.md](agents.md) | Agent policies — scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
+| [validation.md](validation.md) | Protocol validation — deterministic checks against scenario constraints |
+| [scoring.md](scoring.md) | Judge scoring — rigor, feasibility, fidelity |
+| [server.md](server.md) | FastAPI server — REST + WebSocket endpoints, stub environment |
+| [frontend.md](frontend.md) | React UI — dashboard, episode viewer, components |
+| [config.md](config.md) | Shared constants — rounds, budget, timeouts |
+| [tests.md](tests.md) | Test coverage — 87 tests across 6 files |
+## Dependency Graph
+```
+server/app.py
+ ├── replicalab.config
+ ├── replicalab.models
+ ├── replicalab.scenarios (generate_scenario, available_scenario_families)
+ └── replicalab.agents (check_feasibility, suggest_alternative, compose_lab_manager_response)
+replicalab/agents/scientist_policy.py
+ ├── replicalab.models (ScientistAction, ScientistObservation, Protocol, ConversationEntry)
+ └── replicalab.scenarios (NormalizedScenarioPack)
+replicalab/agents/lab_manager_policy.py
+ ├── replicalab.models (LabManagerAction, LabManagerActionType, Protocol)
+ ├── replicalab.scenarios (NormalizedScenarioPack)
+ └── replicalab.utils.validation (ValidationResult, validate_protocol)
+replicalab/scenarios/templates.py
+ ├── replicalab.config (MAX_BUDGET, MAX_ROUNDS)
+ ├── replicalab.models (ScientistObservation, LabManagerObservation)
+ ├── replicalab.scenarios.{math_reasoning, ml_benchmark, finance_trading}
+ └── replicalab.utils.seed (seed_rng)
+replicalab/utils/validation.py
+ ├── replicalab.models (Protocol)
+ └── replicalab.scenarios.templates (NormalizedScenarioPack)
+replicalab/scoring/
+ ├── replicalab.models (Protocol, RewardBreakdown)
+ ├── replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
+ ├── replicalab.agents.lab_manager_policy (check_feasibility, FeasibilityCheckResult)
+ └── replicalab.utils.text (element_tokens, normalize_label)
+```
+## File Tree (implemented only)
+```
+replicalab/
+ ├── __init__.py              (empty)
+ ├── config.py                (shared constants)
+ ├── models.py                (25 classes — all data contracts)
+ ├── agents/
+ │   ├── __init__.py          (re-exports from submodules)
+ │   ├── scientist_policy.py  (AGT 01-04: prompt, formatter, parser, retry, baseline)
+ │   └── lab_manager_policy.py(AGT 05-07: feasibility, suggest, compose)
+ ├── scenarios/
+ │   ├── __init__.py          (re-exports from templates)
+ │   ├── templates.py         (NormalizedScenarioPack, generate_scenario, apply_difficulty)
+ │   ├── math_reasoning.py    (2 cases: Cauchy-Schwarz, Jensen's inequality)
+ │   ├── ml_benchmark.py      (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
+ │   └── finance_trading.py   (2 cases: SPY/QQQ mean-reversion, momentum futures)
+ ├── scoring/
+ │   ├── __init__.py          (exports score_rigor, score_feasibility, score_fidelity)
+ │   ├── rigor.py             (JDG 01: structural quality + criteria coverage)
+ │   ├── feasibility.py       (JDG 02: wraps FeasibilityCheckResult with partial credit)
+ │   └── fidelity.py          (JDG 03: substitution-aware hidden spec alignment)
+ └── utils/
+     ├── seed.py              (deterministic RNG from SHA256)
+     ├── text.py              (shared token matching: normalize_label, element_tokens)
+     └── validation.py        (MOD 05: protocol validation, 5 checks)
+server/
+ └── app.py                   (FastAPI + WebSocket + _StubEnv)
+frontend/
+ ├── package.json             (React 19, Three.js, Framer Motion, Recharts, Tailwind)
+ ├── src/
+ │   ├── App.tsx              (router: /, /episode, /episode/:id)
+ │   ├── types/index.ts       (TypeScript interfaces mirroring Python models)
+ │   ├── lib/
+ │   │   ├── api.ts           (REST + WebSocket client + mock data generators)
+ │   │   ├── audio.ts         (audio utilities)
+ │   │   └── utils.ts         (shared helpers)
+ │   ├── components/          (15 React components)
+ │   └── pages/               (DashboardPage, EpisodePage)
+ └── vite.config.ts
+tests/
+ ├── test_config.py           (3 tests)
+ ├── test_models.py           (15 tests)
+ ├── test_scenarios.py        (8 tests)
+ ├── test_validation.py       (13 tests)
+ ├── test_scientist_policy.py (18 tests)
+ ├── test_lab_manager_policy.py(13 tests)
+ ├── test_reward.py           (18 tests — JDG 01-03 scoring)
+ └── test_server.py           (5 tests — API endpoints)
+```
+## Task Completion Status
+| Area | Done | Remaining | Key gaps |
+|------|------|-----------|----------|
+| Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
+| Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
+| Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
+| Judge (JDG) | JDG 01-03 | JDG 04-08 | Reward composition, bonuses, penalties |
+| Environment (ENV) | — | ENV 01-11 | Entire real environment |
+| Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
+| Frontend (FND) | FND 01-10 | — | Complete |
+| Training (TRN) | — | TRN 01-18 | Entire RL pipeline |

docs/map/agents.md ADDED Viewed

	@@ -0,0 +1,287 @@

+# Agents Map — `replicalab/agents/`
+> Deterministic policy helpers for Scientist and Lab Manager agents.
+> No LLM calls in this module — the LLM backend is injected via `GenerateFn`.
+>
+> **Tasks implemented:** AGT 01-07, 11
+## Exports — `__init__.py`
+```python
+# From lab_manager_policy
+AlternativeSuggestion, FeasibilityCheckResult, SuggestionChange
+check_feasibility, compose_lab_manager_response, suggest_alternative
+# From scientist_policy
+RetryMetadata, ScientistCallResult, ScientistOutputParseError
+build_baseline_scientist_action, build_scientist_system_prompt
+call_scientist_with_retry, format_scientist_observation, parse_scientist_output
+```
+---
+## Scientist Policy — `scientist_policy.py`
+### Pipeline Flow
+```
+scenario → build_scientist_system_prompt() → system_prompt
+                                                    ↓
+observation → format_scientist_observation() → user_message
+                                                    ↓
+              call_scientist_with_retry(generate_fn, system_prompt, obs)
+                   ↓ calls generate_fn(messages)
+                   ↓ calls parse_scientist_output(raw_text)
+                   ↓ on failure: _build_correction_prompt(error)
+                   ↓ retries up to max_retries times
+                   → ScientistCallResult(action, metadata)
+```
+### Public Functions
+#### `build_scientist_system_prompt(scenario) -> str` — AGT 01
+Builds a domain-neutral system prompt from a `NormalizedScenarioPack`.
+**Sections rendered (in order):**
+1. Role statement ("You are the Scientist agent in ReplicaLab")
+2. Job description (negotiate strongest feasible plan)
+3. Domain ID
+4. Task summary
+5. Success criteria (bulleted)
+6. Constraints (with hard/soft labels, quantities, comparators)
+7. Available resources (with availability status)
+8. Allowed substitutions (original → alternative with conditions)
+9. Output contract (exactly one JSON, no extra keys)
+10. Allowed action_type values
+11. Action-specific field requirements
+#### `format_scientist_observation(obs: ScientistObservation) -> str` — AGT 02
+Converts a per-turn observation into the user message string.
+**Sections (fixed order, tested):**
+1. Round status: `"Round {n} of {max}"`
+2. Paper summary: title, hypothesis, method, key finding, goal
+3. Conversation history or "No conversation history yet"
+4. Current protocol or "No protocol has been proposed yet"
+5. ScientistAction schema reminder (field list, action_type values)
+6. Closing instruction: "Respond with exactly one JSON object"
+#### `parse_scientist_output(raw_text: str) -> ScientistAction` — MOD 09
+Strict parser from raw model text into validated `ScientistAction`.
+**Accepts:**
+- Plain JSON objects
+- `\`\`\`json` fenced blocks
+- Prose containing one JSON object
+**Error codes:**
+| Code | Meaning |
+|------|---------|
+| `no_json` | No JSON object found in output |
+| `invalid_json` | JSON syntax error (trailing comma, etc.) |
+| `invalid_action` | Valid JSON but fails ScientistAction validation |
+#### `call_scientist_with_retry(generate_fn, system_prompt, observation, max_retries=2) -> ScientistCallResult` — AGT 03
+Retry loop with error-specific correction prompts.
+**Behavior:**
+1. Builds messages: `[system, user]`
+2. Calls `generate_fn(messages)` → raw text
+3. Calls `parse_scientist_output(raw_text)`
+4. On success: returns `ScientistCallResult(action, metadata)`
+5. On failure: appends `[assistant(bad_output), user(correction)]` to messages, retries
+6. After `max_retries` failures: raises last `ScientistOutputParseError`
+**Correction prompts (`_build_correction_prompt`):**
+- `no_json`: "Your previous response did not contain a JSON object..."
+- `invalid_json`: "Your previous response contained malformed JSON: {error}..."
+- `invalid_action`: "...failed ScientistAction validation: {detail}. Fix the validation error..."
+#### `build_baseline_scientist_action(observation) -> ScientistAction` — AGT 04
+Deterministic non-LLM action for smoke tests. No API calls.
+**Decision tree:**
+1. If protocol exists AND at max rounds → `accept`
+2. If protocol exists AND latest lab_manager feedback indicates blocker → `revise_protocol` (halve sample, reduce duration)
+3. If protocol exists AND no blocker → `accept`
+4. If no protocol → `propose_protocol` (domain-inferred defaults)
+**Domain inference (`_infer_domain`):**
+- Checks paper fields for ML hints (benchmark, dataset, gpu, bert...) → `machine_learning`
+- Checks for finance hints (backtest, sharpe, trading...) → `finance_trading`
+- Default → `mathematics`
+**Blocker detection (`_feedback_indicates_blocker`):**
+- Returns `False` if action_type is `accept` or `report_feasibility`
+- Otherwise checks message for blocker hints: booked, unavailable, exceeds, tight, budget, cost, etc.
+### Classes
+#### `ScientistOutputParseError(ValueError)`
+| Attribute | Type | Purpose |
+|-----------|------|---------|
+| `code` | `Literal["no_json", "invalid_json", "invalid_action"]` | Machine-readable error type |
+| `message` | `str` | Human-readable detail |
+| `raw_text` | `str` | Original model output |
+| `parsed_payload` | `dict \| None` | Decoded JSON if parsing succeeded |
+#### `RetryMetadata(BaseModel)` — `extra="forbid"`
+| Field | Type | Purpose |
+|-------|------|---------|
+| `attempt_count` | `int` | Total attempts (1 = success on first try) |
+| `retry_count` | `int` | `attempt_count - 1` |
+| `last_error_code` | `str \| None` | Error code from last failure |
+| `last_error_message` | `str \| None` | Error message from last failure |
+#### `ScientistCallResult(BaseModel)` — `extra="forbid"`
+| Field | Type |
+|-------|------|
+| `action` | `ScientistAction` |
+| `metadata` | `RetryMetadata` |
+### Type Aliases
+```python
+GenerateFn = Callable[[list[dict[str, str]]], str]
+```
+### Constants
+```python
+_ML_HINTS = ("benchmark", "dataset", "accuracy", "tokenizer", "train", "gpu", ...)
+_FINANCE_HINTS = ("backtest", "drawdown", "sharpe", "trading", "slippage", ...)
+_BLOCKER_HINTS = ("booked", "unavailable", "exceeds", "tight", "budget", "cost", ...)
+```
+---
+## Lab Manager Policy — `lab_manager_policy.py`
+### Pipeline Flow
+```
+protocol + scenario → check_feasibility()
+                           ↓
+                    FeasibilityCheckResult (7 dimensions)
+                           ↓
+              suggest_alternative(protocol, check, scenario)
+                           ↓
+              AlternativeSuggestion | None
+                           ↓
+              compose_lab_manager_response(check, suggestion)
+                           ↓
+                    LabManagerAction (typed, with explanation)
+```
+### Public Functions
+#### `check_feasibility(protocol, scenario) -> FeasibilityCheckResult` — AGT 05
+Runs 7 deterministic dimension checks. No LLM calls.
+**Checks performed:**
+| Dimension | Function | What it checks |
+|-----------|----------|---------------|
+| `protocol` | `_build_protocol_check` | Wraps `validate_protocol()` from MOD 05 |
+| `budget` | `_check_budget` | `_estimate_protocol_cost()` vs `budget_remaining` |
+| `equipment` | `_check_equipment` | Items available/booked, finds substitutions |
+| `reagents` | `_check_reagents` | Items in-stock/out-of-stock, finds substitutions |
+| `schedule` | `_check_schedule` | `duration_days` vs `time_limit_days` |
+| `staff` | `_check_staff` | `_estimate_staff_load()` vs `staff_count` |
+| `policy` | `_check_policy` | Safety restrictions (e.g., offline-only execution) |
+**Cost estimation (`_estimate_protocol_cost`):**
+```
+base = sample_size * 10
++ duration_days * 50
++ len(controls) * 25
++ len(required_equipment) * 100
++ len(required_reagents) * 75
+```
+**Staff estimation (`_estimate_staff_load`):**
+```
+base = 1
++ (1 if sample_size > 20)
++ (1 if len(controls) > 2)
++ (1 if duration_days > 5)
++ (1 if len(required_equipment) > 2)
+```
+#### `suggest_alternative(protocol, check_result, scenario) -> AlternativeSuggestion | None` — AGT 06
+Deterministic revision engine. Returns `None` if already feasible.
+**Fix order (deterministic):**
+1. Equipment substitutions — replace booked items with alternatives
+2. Reagent substitutions — replace out-of-stock items with alternatives
+3. Duration clamp — reduce to `time_limit_days` if over
+4. Sample size reduction — iterative halving until budget fits (max 10 iterations)
+**Post-fix recheck:** runs `check_feasibility()` on revised protocol.
+**Returns:** revised protocol, list of changes, remaining failures, pre/post checks.
+#### `compose_lab_manager_response(check_result, suggestion=None, explanation_renderer=None) -> LabManagerAction` — AGT 07
+Converts grounded results into a typed `LabManagerAction`.
+**Action type selection (`_select_lab_manager_action_type`):**
+| Condition | Action |
+|-----------|--------|
+| All 7 dimensions pass | `ACCEPT` |
+| Suggestion exists AND improved AND only non-lab failures remain | `SUGGEST_ALTERNATIVE` |
+| Lab constraints fail AND no suggestion | `REJECT` |
+| Only policy/protocol fail (not lab constraints) | `REPORT_FEASIBILITY` |
+| Suggestion exists but didn't improve | `REJECT` |
+**Lab constraints = budget, equipment, reagents, schedule, staff (not protocol, not policy).**
+### Classes
+#### `DimensionCheck(BaseModel)` — `extra="forbid"`
+| Field | Type | Default |
+|-------|------|---------|
+| `ok` | `bool` | `True` |
+| `reasons` | `list[str]` | `[]` |
+#### `FeasibilityCheckResult(BaseModel)` — `extra="forbid"`
+| Field | Type |
+|-------|------|
+| `protocol` | `DimensionCheck` |
+| `budget` | `DimensionCheck` |
+| `equipment` | `DimensionCheck` |
+| `reagents` | `DimensionCheck` |
+| `schedule` | `DimensionCheck` |
+| `staff` | `DimensionCheck` |
+| `policy` | `DimensionCheck` |
+| `estimated_cost` | `float` |
+| `required_staff` | `int` |
+| `substitution_options` | `dict[str, list[str]]` |
+| `validation_result` | `ValidationResult` |
+**Computed properties:** `protocol_ok`, `budget_ok`, `equipment_ok`, `reagents_ok`, `schedule_ok`, `staff_ok`, `feasible`, `summary`
+#### `SuggestionChange(BaseModel)` — `extra="forbid"`
+| Field | Type | Purpose |
+|-------|------|---------|
+| `field` | `str` | Which protocol field was changed |
+| `original` | `str` | Original value (stringified) |
+| `revised` | `str` | New value (stringified) |
+| `reason` | `str` | Why it was changed |
+| `tradeoff` | `str` | What is lost |
+#### `AlternativeSuggestion(BaseModel)` — `extra="forbid"`
+| Field | Type |
+|-------|------|
+| `revised_protocol` | `Protocol` |
+| `applied_changes` | `list[SuggestionChange]` |
+| `remaining_failures` | `list[str]` |
+| `improved` | `bool` |
+| `pre_check` | `FeasibilityCheckResult` |
+| `post_check` | `FeasibilityCheckResult` |
+### Type Aliases
+```python
+ExplanationRenderer = Callable[
+    [LabManagerActionType, FeasibilityCheckResult, Optional[AlternativeSuggestion]],
+    str,
+]
+```

docs/map/config.md ADDED Viewed

	@@ -0,0 +1,61 @@

+# Config Map — `replicalab/config.py`
+> Shared constants used across the entire project.
+## Constants
+| Constant | Value | Used by |
+|----------|-------|---------|
+| `DEFAULT_SCENARIO_TEMPLATE` | `"math_reasoning"` | server (reset defaults) |
+| `DEFAULT_DIFFICULTY` | `"easy"` | server (reset defaults) |
+| `MAX_ROUNDS` | `6` | scenarios (observation.max_rounds), server |
+| `MAX_BUDGET` | `5000.0` | scenarios (budget_total base) |
+| `TIMEOUT_SECONDS` | `300` | server (session TTL base) |
+| `ROUND_TIME_LIMIT_SECONDS` | `300` | server (per-round timeout) |
+| `SESSION_TTL_SECONDS` | `300` (= TIMEOUT_SECONDS) | server (session cleanup) |
+| `WS_IDLE_TIMEOUT_SECONDS` | `300` (= TIMEOUT_SECONDS) | server (WebSocket idle) |
+| `STUB_ACCEPT_REWARD` | `5.0` | server (_StubEnv reward on accept) |
+| `API_HOST` | `"0.0.0.0"` | server (uvicorn bind) |
+| `API_PORT` | `7860` | server (uvicorn port) |
+## Who Imports This
+| Consumer | Constants used |
+|----------|---------------|
+| `scenarios/templates.py` | `MAX_BUDGET`, `MAX_ROUNDS` |
+| `server/app.py` | `API_HOST`, `API_PORT`, `DEFAULT_SCENARIO_TEMPLATE`, `DEFAULT_DIFFICULTY`, `MAX_ROUNDS`, `ROUND_TIME_LIMIT_SECONDS`, `SESSION_TTL_SECONDS`, `STUB_ACCEPT_REWARD`, `WS_IDLE_TIMEOUT_SECONDS` |
+| `tests/test_config.py` | All constants (validation tests) |
+## Project Config — `pyproject.toml`
+| Key | Value |
+|-----|-------|
+| Name | `replicalab` |
+| Version | `0.1.0` |
+| Python | `>=3.10` |
+| License | MIT |
+### Dependencies
+| Package | Version | Purpose |
+|---------|---------|---------|
+| `pydantic` | `>=2.7,<3.0` | Data validation |
+| `fastapi` | `>=0.115,<1.0` | REST API framework |
+| `uvicorn[standard]` | `>=0.34,<1.0` | ASGI server |
+| `websockets` | `>=15.0,<17.0` | WebSocket support |
+| `openenv-core[core]` | `>=0.2.1,<0.3.0` | Environment base (not yet used) |
+### Dev Dependencies
+| Package | Purpose |
+|---------|---------|
+| `pytest` | Testing |
+| `pytest-cov` | Coverage |
+| `pytest-asyncio` | Async test support |
+| `httpx` | HTTP client for API tests |
+| `ruff` | Linting |
+| `mypy` | Type checking |
+### Entry Point
+```
+[project.scripts]
+server = "server.app:main"
+```

docs/map/frontend.md ADDED Viewed

	@@ -0,0 +1,141 @@

+# Frontend Map — `frontend/`
+> React 19 + TypeScript + Vite UI for ReplicaLab.
+>
+> **Tasks implemented:** FND 01-10
+## Stack
+| Technology | Version | Purpose |
+|------------|---------|---------|
+| React | 19.2.0 | UI framework |
+| React Router | 7.13.1 | Client-side routing |
+| Three.js | 0.183.2 | 3D molecule scene |
+| @react-three/fiber | 9.5.0 | React Three.js bindings |
+| @react-three/drei | 10.7.7 | Three.js helpers |
+| Framer Motion | 12.35.1 | Animations |
+| @xyflow/react | 12.10.1 | Flow diagrams |
+| Recharts | 3.8.0 | Charts and graphs |
+| Tailwind CSS | 4.2.1 | Utility-first styling |
+| Lucide React | 0.577.0 | Icons |
+## Routes — `App.tsx`
+| Path | Component | Purpose |
+|------|-----------|---------|
+| `/` | `DashboardPage` | Training overview, scenario selection |
+| `/episode` | `EpisodePage` | Live episode viewer (new episode) |
+| `/episode/:episodeId` | `EpisodePage` | Replay of completed episode |
+## Pages
+### `DashboardPage.tsx`
+- Scenario selection (family + difficulty)
+- Training metrics display
+- Episode history list
+- Start new episode button
+### `EpisodePage.tsx`
+- Live negotiation between Scientist and Lab Manager
+- Protocol display and evolution
+- Score breakdown when episode completes
+- Replay controls for completed episodes
+## Components (15 files)
+### Negotiation & Protocol
+| Component | Purpose |
+|-----------|---------|
+| `NegotiationLog.tsx` | Scrollable conversation between agents |
+| `ProtocolPanel.tsx` | Current protocol details display |
+| `PaperPanel.tsx` | Paper summary (title, hypothesis, method, finding) |
+| `LabInventory.tsx` | Equipment and reagent availability |
+| `Controls.tsx` | User controls (start, step, auto-play) |
+### Visualization
+| Component | Purpose |
+|-----------|---------|
+| `ScorePanel.tsx` | Rigor/feasibility/fidelity score bars |
+| `JudgeAuditPanel.tsx` | Judge reasoning and audit trail |
+| `TrainingResults.tsx` | Training metrics charts |
+| `ReplayViewer.tsx` | Step-through replay of completed episodes |
+### 3D & Animation
+| Component | Purpose |
+|-----------|---------|
+| `CharacterStage.tsx` | 3D stage for agent characters |
+| `CharacterAvatar.tsx` | Individual agent avatar |
+| `AnimatedCharacter.tsx` | Character with animations |
+| `MoleculeScene.tsx` | 3D molecule visualization |
+| `TiltCard.tsx` | Tilt-on-hover card component |
+### Layout
+| Component | Purpose |
+|-----------|---------|
+| `Header.tsx` | Top navigation bar |
+## API Client — `lib/api.ts`
+### REST Functions
+| Function | Method | Endpoint |
+|----------|--------|----------|
+| `healthCheck()` | GET | `/health` |
+| `getScenarios()` | GET | `/scenarios` |
+| `resetEpisode(params)` | POST | `/reset` |
+| `stepEpisode(action)` | POST | `/step` |
+| `getReplay(episodeId)` | GET | `/replay/{episodeId}` |
+### WebSocket
+| Function | Purpose |
+|----------|---------|
+| `createWebSocket(onMessage, onOpen, onClose, onError)` | Connect to `/ws` |
+| `sendWsMessage(ws, msg)` | Send typed message |
+### Mock Data (for offline development)
+| Function | Returns |
+|----------|---------|
+| `createMockConversation()` | `NegotiationMessage[]` |
+| `createMockScores()` | `ScoreBreakdown` |
+| `createMockEpisodeState(done)` | `EpisodeState` |
+| `createMockProtocol()` | `Protocol` |
+| `createMockJudgeAudit()` | `JudgeAudit` |
+## TypeScript Types — `types/index.ts`
+Mirrors Python models:
+| TS Interface | Python Model |
+|-------------|--------------|
+| `ScientistAction` | `ScientistAction` |
+| `LabManagerAction` | `LabManagerAction` |
+| `Protocol` | `Protocol` |
+| `EpisodeState` | `EpisodeState` |
+| `StepResult` | `StepResult` |
+| `ScoreBreakdown` | `RewardBreakdown` |
+| `FeasibilityReport` | `FeasibilityCheckResult` (partial) |
+| `JudgeAudit` | `StepInfo.judge_notes` + `verdict` |
+| `NegotiationMessage` | `ConversationEntry` |
+Additional frontend-only types:
+- `TrainingMetrics` — loss, reward curves
+- `TrainingComparison` — baseline vs trained model
+- `PaperSummary` — paper details for display
+- `LabConstraints` — lab resource summary
+- `SuggestedChange` — protocol revision display
+## Utility Files
+### `lib/utils.ts`
+Shared helpers (class merging, formatting, etc.)
+### `lib/audio.ts`
+Audio feedback utilities for UI interactions.
+## Assets
+```
+frontend/public/characters/
+    judge.png           (~1.2 MB)
+    lab-manager.png     (~900 KB)
+    scientist.png       (~900 KB)
+```

docs/map/models.md ADDED Viewed

	@@ -0,0 +1,219 @@

+# Models Map — `replicalab/models.py`
+> All Pydantic data contracts. Frozen with `extra="forbid"` unless noted.
+>
+> **Tasks implemented:** MOD 01, 02, 03, 04, 09, 11, 12
+## Enums
+### `ScientistActionType(str, Enum)`
+| Value | Meaning |
+|-------|---------|
+| `propose_protocol` | First protocol submission |
+| `revise_protocol` | Modify existing protocol |
+| `request_info` | Ask lab manager a question |
+| `accept` | Agree to current protocol |
+### `LabManagerActionType(str, Enum)`
+| Value | Meaning |
+|-------|---------|
+| `report_feasibility` | Report on feasibility without suggestions |
+| `suggest_alternative` | Propose revised protocol |
+| `reject` | Reject protocol outright |
+| `accept` | Approve protocol |
+## Action Models
+### `ScientistAction(BaseModel)` — `extra="forbid"`
+MOD 01 + MOD 09. Strict contract for scientist output.
+| Field | Type | Constraint | Notes |
+|-------|------|-----------|-------|
+| `action_type` | `ScientistActionType` | required | |
+| `sample_size` | `int` | `ge=0` | Must be `>=1` for propose/revise |
+| `controls` | `list[str]` | normalized | |
+| `technique` | `str` | stripped | Required for propose/revise |
+| `duration_days` | `int` | `ge=0` | |
+| `required_equipment` | `list[str]` | normalized | |
+| `required_reagents` | `list[str]` | normalized | |
+| `questions` | `list[str]` | normalized | Required non-empty for request_info |
+| `rationale` | `str` | stripped | Required for propose/revise |
+**Validation rules:**
+- `propose_protocol` / `revise_protocol`: sample_size >= 1, technique required, rationale required, questions must be empty
+- `request_info`: questions non-empty, no protocol payload fields
+- `accept`: no questions, no protocol payload fields
+### `LabManagerAction(BaseModel)` — `extra="forbid"`
+MOD 02. Strict contract with feasible-flag consistency.
+| Field | Type | Constraint | Notes |
+|-------|------|-----------|-------|
+| `action_type` | `LabManagerActionType` | required | |
+| `feasible` | `bool` | required | Must equal AND of all constraint flags |
+| `budget_ok` | `bool` | required | |
+| `equipment_ok` | `bool` | required | |
+| `reagents_ok` | `bool` | required | |
+| `schedule_ok` | `bool` | required | |
+| `staff_ok` | `bool` | required | |
+| `suggested_technique` | `str` | stripped | Only for suggest_alternative |
+| `suggested_sample_size` | `int` | `ge=0` | Only for suggest_alternative |
+| `suggested_controls` | `list[str]` | normalized | Only for suggest_alternative |
+| `explanation` | `str` | required non-empty | |
+**Validation rules:**
+- `feasible` must equal `all(budget_ok, equipment_ok, reagents_ok, schedule_ok, staff_ok)`
+- `accept` requires `feasible=True`
+- `reject` requires `feasible=False`
+- `suggest_alternative` requires `feasible=False` + at least one suggestion field
+- Suggestion fields forbidden for non-suggest_alternative actions
+## Observation Models
+### `ConversationEntry(BaseModel)` — `extra="forbid"`
+| Field | Type | Notes |
+|-------|------|-------|
+| `role` | `Literal["scientist", "lab_manager", "system"]` | |
+| `message` | `str` | Required non-empty |
+| `round_number` | `int` | `ge=0` |
+| `action_type` | `Optional[str]` | Null or non-empty |
+### `Protocol(BaseModel)` — `extra="forbid"`
+Shared protocol payload used in observations and actions.
+| Field | Type | Notes |
+|-------|------|-------|
+| `sample_size` | `int` | `ge=0` |
+| `controls` | `list[str]` | normalized |
+| `technique` | `str` | required non-empty |
+| `duration_days` | `int` | `ge=0` |
+| `required_equipment` | `list[str]` | normalized |
+| `required_reagents` | `list[str]` | normalized |
+| `rationale` | `str` | required non-empty |
+### `ScientistObservation(BaseModel)` — `extra="forbid"`
+| Field | Type |
+|-------|------|
+| `paper_title` | `str` |
+| `paper_hypothesis` | `str` |
+| `paper_method` | `str` |
+| `paper_key_finding` | `str` |
+| `experiment_goal` | `str` |
+| `conversation_history` | `list[ConversationEntry]` |
+| `current_protocol` | `Optional[Protocol]` |
+| `round_number` | `int` (ge=0) |
+| `max_rounds` | `int` (ge=0) |
+### `LabManagerObservation(BaseModel)` — `extra="forbid"`
+| Field | Type |
+|-------|------|
+| `budget_total` | `float` (ge=0) |
+| `budget_remaining` | `float` (ge=0) |
+| `equipment_available` | `list[str]` |
+| `equipment_booked` | `list[str]` |
+| `reagents_in_stock` | `list[str]` |
+| `reagents_out_of_stock` | `list[str]` |
+| `staff_count` | `int` (ge=0) |
+| `time_limit_days` | `int` (ge=0) |
+| `safety_restrictions` | `list[str]` |
+| `conversation_history` | `list[ConversationEntry]` |
+| `current_protocol` | `Optional[Protocol]` |
+| `round_number` | `int` (ge=0) |
+| `max_rounds` | `int` (ge=0) |
+### `Observation(BaseModel)` — `extra="forbid"`
+Combined wrapper. Each role receives its own view.
+| Field | Type |
+|-------|------|
+| `scientist` | `Optional[ScientistObservation]` |
+| `lab_manager` | `Optional[LabManagerObservation]` |
+## Reward & Step Models
+### `RewardBreakdown(BaseModel)` — default `extra="forbid"`
+MOD 11. Component scores from judge rubric engine.
+| Field | Type | Default | Range |
+|-------|------|---------|-------|
+| `rigor` | `float` | 0.0 | [0, 1] |
+| `feasibility` | `float` | 0.0 | [0, 1] |
+| `fidelity` | `float` | 0.0 | [0, 1] |
+| `efficiency_bonus` | `float` | 0.0 | unbounded |
+| `communication_bonus` | `float` | 0.0 | unbounded |
+| `penalties` | `dict[str, float]` | {} | unbounded |
+### `StepInfo(BaseModel)` — `extra="allow"`
+MOD 11. Extensible metadata returned with each step.
+| Field | Type | Default |
+|-------|------|---------|
+| `agreement_reached` | `bool` | False |
+| `error` | `Optional[str]` | None |
+| `reward_breakdown` | `Optional[RewardBreakdown]` | None |
+| `judge_notes` | `Optional[str]` | None |
+| `verdict` | `Optional[str]` | None |
+### `StepResult(BaseModel)`
+| Field | Type | Default |
+|-------|------|---------|
+| `observation` | `Optional[Observation]` | None |
+| `reward` | `float` | 0.0 |
+| `done` | `bool` | False |
+| `info` | `StepInfo` | StepInfo() |
+## Episode Models
+### `EpisodeState(BaseModel)` — MOD 04
+Full internal state for debugging and replay.
+| Field | Type | Default |
+|-------|------|---------|
+| `seed` | `int` | 0 |
+| `scenario_template` | `str` | "" |
+| `difficulty` | `str` | "easy" |
+| `paper_title` | `str` | "" |
+| `paper_hypothesis` | `str` | "" |
+| `paper_method` | `str` | "" |
+| `paper_key_finding` | `str` | "" |
+| `experiment_goal` | `str` | "" |
+| `lab_budget_total` | `float` | 0.0 |
+| `lab_budget_remaining` | `float` | 0.0 |
+| `lab_equipment` | `list[str]` | [] |
+| `lab_reagents` | `list[str]` | [] |
+| `lab_staff_count` | `int` | 0 |
+| `lab_time_limit_days` | `int` | 0 |
+| `current_protocol` | `Optional[Protocol]` | None |
+| `conversation_history` | `list[ConversationEntry]` | [] |
+| `round_number` | `int` | 0 |
+| `max_rounds` | `int` | 0 |
+| `done` | `bool` | False |
+| `agreement_reached` | `bool` | False |
+| `reward` | `float` | 0.0 |
+| `rigor_score` | `float` | 0.0 |
+| `feasibility_score` | `float` | 0.0 |
+| `fidelity_score` | `float` | 0.0 |
+### `EpisodeLog(BaseModel)` — MOD 04
+Completed episode record for logging, replay, evaluation.
+| Field | Type | Default |
+|-------|------|---------|
+| `episode_id` | `str` | "" |
+| `seed` | `int` | 0 |
+| `scenario_template` | `str` | "" |
+| `difficulty` | `str` | "easy" |
+| `final_state` | `Optional[EpisodeState]` | None |
+| `transcript` | `list[ConversationEntry]` | [] |
+| `reward_breakdown` | `Optional[RewardBreakdown]` | None |
+| `total_reward` | `float` | 0.0 |
+| `rounds_used` | `int` | 0 |
+| `agreement_reached` | `bool` | False |
+| `judge_notes` | `str` | "" |
+| `verdict` | `str` | "" |
+## Helper Functions
+| Function | Purpose |
+|----------|---------|
+| `_normalize_string_list(value)` | Strip whitespace, reject empty strings |

docs/map/scenarios.md ADDED Viewed

	@@ -0,0 +1,153 @@

+# Scenarios Map — `replicalab/scenarios/`
+> Normalized scenario generation across 3 domains with seeded determinism.
+>
+> **Tasks implemented:** SCN 01-12
+## Entry Point
+### `generate_scenario(seed, template, difficulty) -> NormalizedScenarioPack`
+Located in `templates.py`. The main public API.
+**Flow:**
+1. `seed_rng(seed)` → deterministic `random.Random` instance
+2. `load_template(template)` → picks the template builder function
+3. `builder(rng)` → raw draft dict (randomly selects one of 2 cases per domain)
+4. `apply_difficulty(draft, difficulty, rng)` → scales budget, time, staff, resources
+5. `_build_pack(seed, template, draft)` → constructs `NormalizedScenarioPack`
+### `available_scenario_families() -> list[dict]`
+Returns `[{"family": name, "difficulties": ["easy", "medium", "hard"]}]` for each template.
+## Core Data Classes (all in `templates.py`)
+### `NormalizedScenarioPack(BaseModel)` — `extra="forbid"`
+The complete scenario definition. Every downstream consumer uses this.
+| Field | Type | Source |
+|-------|------|--------|
+| `scenario_id` | `str` | `"{template}_{seed}"` |
+| `template` | `TemplateName` | input param |
+| `domain_id` | `str` | from template case |
+| `difficulty` | `Difficulty` | input param |
+| `seed` | `int` | input param |
+| `task_summary` | `str` | from template case |
+| `success_criteria` | `list[str]` | from template case |
+| `constraints` | `list[ScenarioConstraint]` | from template + difficulty scaling |
+| `resources` | `list[ScenarioResource]` | from template + difficulty scaling |
+| `allowed_substitutions` | `list[AllowedSubstitution]` | from template case |
+| `hidden_reference_spec` | `HiddenReferenceSpec` | from template case |
+| `scientist_observation` | `ScientistObservation` | built from case fields |
+| `lab_manager_observation` | `LabManagerObservation` | built from case fields |
+### `ScenarioConstraint(BaseModel)`
+| Field | Type | Example |
+|-------|------|---------|
+| `key` | `str` | `"gpu_hours"` |
+| `label` | `str` | `"Maximum GPU budget"` |
+| `quantity` | `float \| int \| None` | `8` |
+| `unit` | `str \| None` | `"gpu_hours"` |
+| `comparator` | `Literal["<=", ">=", "="]` | `"<="` |
+| `hard` | `bool` | `True` |
+| `details` | `str` | `"The full run must fit within eight GPU-hours."` |
+### `ScenarioResource(BaseModel)`
+| Field | Type | Example |
+|-------|------|---------|
+| `key` | `str` | `"gpu_node"` |
+| `label` | `str` | `"A100 GPU node"` |
+| `quantity` | `float \| int \| None` | `1` |
+| `unit` | `str \| None` | `"node"` |
+| `available` | `bool` | `True` |
+| `category` | `str` | `"compute"` |
+| `details` | `str` | `"Reserved for one benchmark run at a time."` |
+### `AllowedSubstitution(BaseModel)`
+| Field | Type | Example |
+|-------|------|---------|
+| `original` | `str` | `"A100 GPU node"` |
+| `alternative` | `str` | `"V100 GPU node"` |
+| `condition` | `str` | `"Use if A100 is booked."` |
+| `tradeoff` | `str` | `"V100 is slower; extend training by ~30%."` |
+### `HiddenReferenceSpec(BaseModel)`
+Ground truth the judge uses to score fidelity. The scientist never sees this.
+| Field | Type | Example |
+|-------|------|---------|
+| `summary` | `str` | `"A valid plan keeps the published split..."` |
+| `required_elements` | `list[str]` | `["published data split", "held-out accuracy evaluation"]` |
+| `flexible_elements` | `list[str]` | `["batch size", "learning-rate schedule"]` |
+| `target_metric` | `str` | `"held_out_accuracy"` |
+| `target_value` | `str` | `"within one point of the reported baseline"` |
+## Template Builders
+Each returns a raw `dict[str, Any]` with one randomly selected case.
+### `build_math_reasoning_template(rng)` — `math_reasoning.py`
+- **Domain:** `mathematics`
+- **Case A:** Cauchy-Schwarz inequality — structured proof verification
+- **Case B:** Jensen's inequality — convexity-based proof
+- **Equipment:** Structured proof notebook, Automated proof checker
+- **Reagents:** Graduate reviewer, Reference textbook
+- **Substitutions:** Graduate reviewer → self-check rubric
+### `build_ml_benchmark_template(rng)` — `ml_benchmark.py`
+- **Domain:** `machine_learning`
+- **Case A:** AG News TinyBERT — text classification replication
+- **Case B:** CIFAR-10 ResNet-18 — image classification replication
+- **Equipment:** A100 GPU node, Dataset mirror, Experiment tracker
+- **Reagents:** Pre-trained checkpoint, Evaluation harness
+- **Substitutions:** A100 → V100 (slower), full dataset → stratified sample
+### `build_finance_trading_template(rng)` — `finance_trading.py`
+- **Domain:** `finance_trading`
+- **Case A:** SPY/QQQ mean-reversion — pairs trading backtest
+- **Case B:** Momentum futures — trend-following strategy
+- **Equipment:** Backtest engine, Historical daily bar dataset
+- **Reagents:** Risk reviewer, Compliance packet
+- **Substitutions:** Daily bars → weekly bars, risk reviewer → automated risk check
+- **Safety restrictions:** offline-only execution policy
+## Difficulty Scaling — `apply_difficulty(draft, difficulty, rng)`
+| Parameter | Easy | Medium | Hard |
+|-----------|------|--------|------|
+| `budget_total` | ×1.15 | ×0.95 | ×0.80 |
+| `time_limit_days` | unchanged | −1 day | −1 day |
+| `staff_count` | unchanged | unchanged | −1 person |
+| Resources tightened | 0 | 1 | 2 |
+| Conflict constraint | no | yes (1) | yes (1) |
+**`_tighten_one_resource`**: picks a random resource, sets `available=False`.
+**`_append_conflict_constraint`**: adds a soft constraint noting resource conflict.
+## Utility — `replicalab/utils/seed.py`
+| Function | Purpose |
+|----------|---------|
+| `get_deterministic_seed(seed, namespace)` | SHA256-based child seed derivation |
+| `seed_rng(seed, namespace)` | Returns `random.Random(derived_seed)` |
+## Type Aliases
+```python
+Difficulty = Literal["easy", "medium", "hard"]
+TemplateName = Literal["math_reasoning", "ml_benchmark", "finance_trading"]
+TemplateBuilder = Callable[[Any], dict[str, Any]]
+```
+## Constants
+```python
+GOLDEN_SCENARIO_SPECS_PATH = Path("tests/fixtures/golden_scenarios.json")
+```
+## Who Consumes This
+- **`validation.py`** — reads constraints, resources, substitutions, hidden_reference_spec
+- **`lab_manager_policy.py`** — reads lab_manager_observation, substitutions, constraints
+- **`scientist_policy.py`** — reads scenario pack for system prompt generation
+- **`server/app.py`** — calls `generate_scenario()` on reset, stores pack for lab manager
+- **`scoring/`** (future) — will read hidden_reference_spec for fidelity scoring

docs/map/scoring.md ADDED Viewed

	@@ -0,0 +1,250 @@

+# Scoring Map — `replicalab/scoring/`
+> Judge scoring engine for protocol evaluation.
+> Pure deterministic functions — no model calls, no side effects.
+>
+> **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08
+> **Tasks remaining:** JDG 07
+## Oracle Hybrid Note
+The repo now includes an additive Oracle layer for richer scenario generation,
+optional Lab Manager narration, optional event injection, and post-mortem
+analysis. None of that replaces the files in `replicalab/scoring/`.
+For RL training, this folder remains the canonical reward source:
+- deterministic
+- reproducible
+- testable
+- used by the environment for the actual scalar reward signal
+## Architecture
+```
+replicalab/scoring/
+    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity,
+                         #          build_reward_breakdown, compute_total_reward
+    rigor.py             # JDG 01 — protocol structural quality
+    feasibility.py       # JDG 02 — resource feasibility (wraps AGT 05)
+    fidelity.py          # JDG 03 — adherence to hidden reference spec
+    rubric.py            # JDG 04-05 — total reward formula and breakdown builder
+    explain.py           # JDG 06 — deterministic plain-English explanation
+```
+## Current Reward Structure
+The training signal now has two layers:
+- **Terminal reward** from `replicalab/scoring/rubric.py`
+  - `10 * rigor * feasibility * fidelity * parsimony`
+  - plus bonuses
+  - minus named penalties
+- **Step shaping reward** from `replicalab/env/replicalab_env.py`
+  - information-gain bonus for novel questions
+  - protocol-delta and momentum bonuses for productive revisions
+  - contradiction, hallucination, stalling, regression, invalid-action,
+    timeout, and no-agreement penalties
+The judge remains deterministic. The terminal audit still explains the final
+`RewardBreakdown`, while cumulative episode reward now includes the per-step
+shaping applied inside the environment.
+## Shared Utilities
+Token matching extracted into `replicalab/utils/text.py`:
+- `normalize_label(label) -> str` — lowercase, strip, collapse whitespace
+- `element_tokens(element) -> list[str]` — split into searchable tokens (3+ chars)
+Used by: `validation.py`, `rigor.py`, `fidelity.py`
+---
+## JDG 01 — `score_rigor(protocol, scenario) -> float`
+**File:** `rigor.py`
+**Range:** [0.0, 1.0]
+**Measures:** structural completeness and alignment to scenario requirements.
+### Weight Breakdown
+| Component | Weight | Method |
+|-----------|--------|--------|
+| Structural completeness | 0.30 | Field population checks |
+| Success criteria coverage | 0.40 | Token match vs `scenario.success_criteria` |
+| Required element coverage | 0.30 | Token match vs `hidden_reference_spec.required_elements` |
+### Structural Completeness (0.30)
+Each check contributes equally (0.05 each, total 0.35, then normalized):
+| Check | Condition |
+|-------|-----------|
+| Sample size present | `sample_size >= 1` |
+| Sample size meaningful | `sample_size >= 4` |
+| Has control | `len(controls) >= 1` |
+| Has second control | `len(controls) >= 2` |
+| Technique specified | `technique` non-empty |
+| Duration allocated | `duration_days >= 1` |
+| Substantive rationale | `len(rationale) > 20` |
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `_structural_completeness(protocol)` | Field population score |
+| `_success_criteria_coverage(protocol, scenario)` | Fraction of criteria matched |
+| `_required_element_coverage(protocol, scenario)` | Fraction of elements matched |
+| `_protocol_text_blob(protocol)` | Join all text fields for matching |
+| `_text_matches(element, blob)` | Token overlap check |
+---
+## JDG 02 — `score_feasibility(protocol, scenario, check=None) -> float`
+**File:** `feasibility.py`
+**Range:** [0.0, 1.0]
+**Measures:** whether the lab can execute this protocol.
+### Key Design: No Rescoring
+Does NOT recompute feasibility from scratch. Derives score from `FeasibilityCheckResult`
+produced by AGT 05's `check_feasibility()`. If no pre-computed check is passed, calls
+`check_feasibility()` internally. This prevents drift between Lab Manager grounding
+and Judge scoring.
+### Weight Breakdown
+7 dimensions, each worth 1/7:
+| Dimension | Type | Partial Credit Formula |
+|-----------|------|----------------------|
+| Protocol | Binary | 1.0 if ok, else 0.0 |
+| Budget | Continuous | `min(1.0, budget_remaining / estimated_cost)` |
+| Equipment | Continuous | fraction of required items that are available |
+| Reagents | Continuous | fraction of required items that are in stock |
+| Schedule | Binary | 1.0 if ok, else 0.0 |
+| Staff | Continuous | `min(1.0, staff_count / required_staff)` |
+| Policy | Binary | 1.0 if ok, else 0.0 (hard constraint) |
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `_budget_score(check, budget_remaining)` | Continuous budget ratio |
+| `_staff_score(check, staff_count)` | Continuous staff ratio |
+| `_fraction_score(required, available)` | Generic item-availability fraction |
+---
+## JDG 03 — `score_fidelity(protocol, scenario) -> float`
+**File:** `fidelity.py`
+**Range:** [0.0, 1.0]
+**Measures:** adherence to `hidden_reference_spec` (which the scientist never sees).
+### Weight Breakdown
+| Component | Weight | Method |
+|-----------|--------|--------|
+| Required element coverage | 0.50 | Substitution-aware token match |
+| Flexible element alignment | 0.20 | Bonus only, no penalty |
+| Target metric alignment | 0.20 | Token match vs metric + value |
+| Technique appropriateness | 0.10 | Token match vs spec summary |
+### Substitution-Aware Scoring
+For required elements:
+- **Direct match** (token in protocol text): 1.0 credit
+- **Substitution match** (allowed alternative present): 0.7 credit
+- **Miss**: 0.0 credit
+This is the key difference from JDG 01's element check.
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `_required_element_score(elements, text, sub_index)` | Substitution-aware coverage |
+| `_flexible_element_score(elements, text)` | Bonus-only coverage |
+| `_target_metric_score(metric, value, text)` | Metric + value matching |
+| `_technique_score(summary, text)` | Summary alignment |
+| `_protocol_text_blob(protocol)` | Join text fields |
+| `_text_matches(element, blob)` | Token overlap |
+| `_substitution_matches(element, text, sub_index)` | Check alternatives |
+| `_build_substitution_index(scenario)` | Map originals → alternatives |
+---
+---
+## JDG 04 — `compute_total_reward(breakdown) -> float`
+**File:** `rubric.py`
+**Formula:** `10 × rigor × feasibility × fidelity + efficiency_bonus + communication_bonus − sum(penalties)`
+Returns a scalar reward from a `RewardBreakdown` object.
+## JDG 05 — `build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown`
+**File:** `rubric.py`
+**Composes:** rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus.
+### Efficiency Bonus
+- Max bonus: 1.0 (configurable via `_MAX_EFFICIENCY_BONUS`)
+- Formula: `bonus × (max_rounds - rounds_used) / (max_rounds - 1)`
+- Finishing in round 1 of 6 → maximum bonus; using all rounds → 0
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `compute_total_reward(breakdown)` | Apply the reward formula |
+| `build_reward_breakdown(...)` | Compose all sub-scores into a breakdown |
+| `_efficiency_bonus(rounds_used, max_rounds)` | Compute efficiency bonus |
+---
+## Not Yet Implemented
+### Bonuses & Penalties — JDG 07
+- `communication_bonus`: reward for clear negotiation (reserved)
+- `penalties`: policy violations, hallucinated resources, etc.
+## Data Consumed
+| Source | Used by | For what |
+|--------|---------|----------|
+| `Protocol` (models.py) | All 3 scorers | The final agreed protocol |
+| `NormalizedScenarioPack` (scenarios) | All 3 scorers | Constraints, resources, criteria |
+| `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
+| `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks with partial credit |
+| `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
+| `element_tokens` (utils/text.py) | JDG 01, JDG 03 | Shared token extraction |
+## Test Coverage — `tests/test_reward.py`
+| Test | What it verifies |
+|------|-----------------|
+| `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering |
+| `test_rigor_is_deterministic` | Same inputs → same output |
+| `test_rigor_empty_controls_reduces_score` | Controls matter |
+| `test_rigor_short_rationale_reduces_score` | Rationale length matters |
+| `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 |
+| `test_feasibility_infeasible_protocol_scores_lower` | Bad < good |
+| `test_feasibility_accepts_precomputed_check` | Pre-computed = computed |
+| `test_feasibility_is_deterministic` | Same inputs → same output |
+| `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over |
+| `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned |
+| `test_fidelity_is_deterministic` | Same inputs → same output |
+| `test_fidelity_substitution_gets_partial_credit` | Sub > miss |
+| `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps |
+| `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check |
+| `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency |
+| `test_good_protocol_beats_awful_protocol_on_all_scores_and_total_reward` | Good protocol beats a clearly infeasible protocol across all judge axes |
+| `test_rigor_explicit_success_criteria_mentions_improve_score` | Success-criteria mentions improve rigor coverage |
+| `test_feasibility_partial_equipment_credit_sits_between_full_and_total_miss` | Partial equipment availability yields intermediate feasibility credit |
+| `test_fidelity_direct_match_beats_substitution_and_miss` | Fidelity prefers direct match over allowed substitution over a miss |
+| `test_breakdown_matches_with_and_without_precomputed_feasibility_check` | Reward breakdown stays identical with or without an injected feasibility check |

docs/map/server.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Server Map - `server/app.py`
+> FastAPI backend with REST + WebSocket endpoints. The normal path now uses
+> the real `ReplicaLabEnv`; `_StubEnv` remains only as a fallback if the env
+> package cannot be imported.
+>
+> **Tasks implemented:** API 01-09, 13, 15
+## Environment path
+### `ReplicaLabEnv`
+Primary environment implementation imported from
+`replicalab.env.replicalab_env`.
+### `_StubEnv`
+Legacy fallback kept so the server can still boot if the real env import
+fails. It is no longer the intended local or Docker runtime.
+### `_make_env()`
+Factory that prefers `ReplicaLabEnv` and falls back to `_StubEnv` only on
+import failure.
+## REST endpoints
+### `GET /health`
+Returns a liveness payload. When the real env path is active, the response
+includes `env: "real"`.
+### `POST /reset`
+Starts a new episode and returns:
+- `session_id`
+- `episode_id`
+- typed `Observation`
+### `POST /step`
+Submits a typed `ScientistAction` and returns `StepResult`.
+When `done=true`, the terminal `StepResult` is also used to build the replay
+log so `reward_breakdown`, `judge_notes`, and `verdict` stay aligned with the
+real env result.
+### `GET /scenarios`
+Returns the available scenario families and supported difficulties.
+### `GET /replay/{episode_id}`
+Returns the stored `EpisodeLog` for a completed episode or 404 if not found.
+## WebSocket endpoint
+### `WS /ws`
+Per-connection isolated environment session supporting:
+- `reset`
+- `step`
+- `ping`
+Idle timeout and disconnect cleanup are implemented and verified.
+## Session management
+| Store | Purpose |
+| --- | --- |
+| `_sessions` | Active REST sessions |
+| `_replay_store` | Completed episode logs |
+## Key helpers
+| Function | Purpose |
+| --- | --- |
+| `_build_episode_log(episode_id, state, result)` | Build replay log from final state and terminal step result |
+| `_touch(session_id)` | Refresh REST session last-active timestamp |
+| `_cleanup_stale_sessions()` | Remove expired REST sessions |
+## Current deployment state
+- Local OpenEnv validation passes
+- Local Docker build and run verification passes
+- HF Spaces metadata is present in the root `README.md` and root `Dockerfile`
+- Live hosted verification remains `API 10`

docs/map/tests.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# Tests Map - `tests/`
+> 365 tests across 18 files. All passing.
+>
+> **Last verified:** 2026-03-08
+## Summary
+| File | Tests | What it covers |
+|------|-------|----------------|
+| `test_api_rest_isolation.py` | 11 | `API 14` REST session isolation and replay separation |
+| `test_cache.py` | 2 | Oracle scenario caching and reuse |
+| `test_client.py` | 24 | `TRN 13` reusable client over REST and WebSocket |
+| `test_config.py` | 3 | Shared constants and config consistency |
+| `test_env.py` | 56 | `ENV 01-08`, `ENV 10`, `ENV 11`, `OBS 04`, `JDG 04-05`, `TST 01-03` |
+| `test_judge_policy.py` | 10 | `JDG 11` structured judge audit payload |
+| `test_lab_manager_policy.py` | 37 | `AGT 05-07` plus `AGT 09` determinism coverage |
+| `test_models.py` | 21 | Action, observation, step, state, and log contracts |
+| `test_logging.py` | 11 | `MOD 07` replay persistence and `JDG 07` CSV logging helpers |
+| `test_oracle.py` | 5 | Oracle hybrid wrapper, structured parsing, and env reset adapter |
+| `test_prompts.py` | 7 | `AGT 10` prompt files and Oracle prompt asset loading |
+| `test_reward.py` | 40 | `JDG 01-06`, `JDG 08`, and reward regression coverage |
+| `test_rollout.py` | 12 | `TRN 03` rollout worker behavior |
+| `test_rollout_traces.py` | 2 | `TRN 04` bounded tool trace aggregation and batched collection |
+| `test_scenarios.py` | 14 | `SCN 01-13` scenario generation, determinism, and Oracle scenario adaptation |
+| `test_scientist_policy.py` | 46 | `MOD 09`, `AGT 01-04`, `AGT 08` |
+| `test_server.py` | 44 | `API 01-04`, `API 06-08`, `API 13-14`, replay audit propagation, and root landing page |
+| `test_validation.py` | 20 | `MOD 05-06` semantic validation |
+| **Total** | **365** | |
+## Coverage Notes
+- The environment stack is covered end to end:
+  - `test_env.py` validates reset, step, invalid action, termination, reward integration, deep state snapshots, close/reopen lifecycle behavior, terminal judge-audit propagation, and seeded replay determinism across all scenario families.
+- The API/server stack is covered end to end:
+  - `test_server.py` covers REST reset/step/scenarios, WebSocket session handling, idle timeout cleanup, CORS behavior, and replay audit propagation.
+- The scientist stack is covered end to end:
+  - `test_scientist_policy.py`, `test_prompts.py`, `test_rollout.py`, and `test_rollout_traces.py` together cover prompt construction, observation formatting, parse/retry, baseline policy, rollout collection, and bounded tool trace capture.
+- The judge stack is covered end to end:
+  - `test_reward.py` covers rubric scores and reward math, while `test_judge_policy.py` covers structured audit payload generation.
+- The Oracle hybrid layer is covered additively:
+  - `test_oracle.py`, `test_cache.py`, and `test_prompts.py` cover Oracle scenario generation wrappers, cache reuse, and prompt asset loading without changing the deterministic reward contract.
+## Remaining Gaps
+| Planned test work | Why it still matters |
+|-------------------|----------------------|
+| `TST 09` notebook smoke coverage | Fresh-runtime validation for the judged training notebook |
+## Task-to-Test Mapping
+| Area | Primary test files |
+|------|--------------------|
+| Models and contracts | `test_models.py`, `test_validation.py` |
+| Scenarios | `test_scenarios.py` |
+| Oracle integration and cache | `test_oracle.py`, `test_cache.py`, `test_prompts.py` |
+| Scientist policy | `test_scientist_policy.py`, `test_prompts.py` |
+| Lab Manager policy | `test_lab_manager_policy.py` |
+| Judge and reward | `test_reward.py`, `test_judge_policy.py` |
+| Environment | `test_env.py` |
+| API and deployment-facing server behavior | `test_server.py` |
+| Client and training rollouts | `test_client.py`, `test_rollout.py`, `test_rollout_traces.py` |