Spaces:

ayushozha
/

replicalab

Running

ayushozha commited on Mar 8

Commit

80aa0ec

1 Parent(s): 16e9144

Add model selection and architecture notes

- Qwen3-4B as primary trainable Scientist, Qwen3-8B as H100 stretch
- Deterministic rubric remains sole training reward
- Hosted frontier evaluator for optional explanation and demo audit only
- Lab Manager stays rule-based for MVP
- Future model-backed Lab Manager added to stretch backlog and risk register
- Updated Person B docs with base-model rationale and reward notes

Files changed (3) hide show

ReplicaLab_Comprehensive_Task_Division.md +32 -5
docs/ayush/task_breakdown.md +56 -17
docs/ayush/task_list.md +9 -2

ReplicaLab_Comprehensive_Task_Division.md CHANGED Viewed

@@ -96,13 +96,35 @@ By judging time, the project should demonstrate:
 | Storytelling | everyone contributes screenshots, gifs, examples |
 | Submission readiness | all four review final demo, notebook, README, repo visibility |
-## 4.1 Training compute availability
 1. The team has access to an H100 GPU for heavier Scientist training and evaluation runs.
 2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
 3. The judged artifact remains the Colab notebook, so any H100 run must still have a documented notebook path or reduced scale fallback that can be shown in Colab.
 4. Person C supports any environment URL, secret, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
 ---
 ## 5. Module and function ownership map
@@ -209,12 +231,16 @@ Create a stable shared codebase, contracts, and development workflow so all work
 - `FND 01` status: completed on 2026-03-07
 - `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
 - `FND 10` status: completed on 2026-03-07
 - `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
 - Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
 - Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
-- Remaining work now unblocked by `FND 01`: `FND 02`, `FND 03`, `FND 04`, `FND 05`, `FND 06`, `FND 07`
-- Remaining Epic E01 work still gated by follow-on dependencies: `FND 08`, `FND 09`, `FND 11`, `FND 12`, `FND 13`
 ### User stories
@@ -231,7 +257,7 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
 | FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
 | FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ⬜ Not started | — |
 | FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ⬜ Not started | — |
-| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ⬜ Not started | — |
 | FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ⬜ Not started | — |
 | FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ⬜ Not started | — |
 | FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | ⬜ Not started | — |
@@ -707,7 +733,7 @@ The MVP is complete when all of the following are true:
 | 2 | add judge plain English explanation panel | better judge readability |
 | 3 | add second and third difficulty levels to all templates | stronger world modeling story |
 | 4 | add curriculum training path | stronger self improvement story |
-| 5 | add optional LLM Lab Manager | stronger multi agent depth but higher risk |
 | 6 | add third agent such as ethics reviewer | potential partner fit extension |
 | 7 | add post episode self critique before retry | stronger self improvement story from Blueprint Section 14.2 |
 | 8 | add automatic scenario difficulty scaling | adaptive curriculum from Blueprint Section 14.2 |
@@ -725,6 +751,7 @@ The MVP is complete when all of the following are true:
 | reward too noisy or subjective | high | Person A | keep judge deterministic and rubric based |
 | final demo breaks live | high | all | keep replay logs and a pre tested demo seed ready |
 | too many scenarios | medium | Person A | ship one excellent scenario, then add more only if stable |
 ---

 | Storytelling | everyone contributes screenshots, gifs, examples |
 | Submission readiness | all four review final demo, notebook, README, repo visibility |
+## 4.1 Training compute and model selection
 1. The team has access to an H100 GPU for heavier Scientist training and evaluation runs.
 2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
 3. The judged artifact remains the Colab notebook, so any H100 run must still have a documented notebook path or reduced scale fallback that can be shown in Colab.
 4. Person C supports any environment URL, secret, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
+### Trainable model
+The primary trainable model for the Scientist policy is **Qwen3-4B**.
+| Model | Role | Rationale |
+| --- | --- | --- |
+| Qwen3-4B | Primary Scientist policy | BF16 fits H100 (~14GB weights, ~42-56GB training). 4-bit fits Colab T4 (5.5GB). Strong structured output for JSON action schemas. Fast RL iteration speed. |
+| Qwen3-8B | H100-only stretch | Better reasoning quality but 4-bit barely fits Colab T4 (6.5GB). Use only if Qwen3-4B quality is insufficient and Colab demo uses reduced-scale fallback. |
+### Evaluator layer
+The training reward is always the **deterministic rubric engine** defined in E05. A hosted frontier evaluator may optionally be used for post-episode explanation and demo audit only. The frontier evaluator is never part of the training reward loop.
+### MVP role implementations
+| Role | MVP implementation | Future stretch |
+| --- | --- | --- |
+| Scientist | Trainable policy (Qwen3-4B) | Qwen3-8B if quality insufficient |
+| Lab Manager | Rule-based deterministic policy | Model-backed policy using same base model with separate adapter |
+| Judge (training reward) | Deterministic rubric engine | Unchanged |
+| Judge (explanation layer) | Optional hosted frontier evaluator | Extended explanation panel in UI |
 ---
 ## 5. Module and function ownership map
 - `FND 01` status: completed on 2026-03-07
 - `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
+- `FND 04` status: completed on 2026-03-08
+- `FND 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
 - `FND 10` status: completed on 2026-03-07
 - `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
 - Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
+- Completed scope for `FND 04`: added importable empty Pydantic model stubs in `replicalab/models.py` for the shared action, observation, step, state, and log contracts
 - Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
+- Remaining work now unblocked by `FND 01`: `FND 02`, `FND 03`, `FND 05`, `FND 06`, `FND 07`
+- Newly unblocked by `FND 04`: `FND 08`, `FND 09`
+- Remaining Epic E01 work still gated by follow-on dependencies: `FND 11`, `FND 12`, `FND 13`
 ### User stories
 | FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
 | FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ⬜ Not started | — |
 | FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ⬜ Not started | — |
+| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
 | FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ⬜ Not started | — |
 | FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ⬜ Not started | — |
 | FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | ⬜ Not started | — |
 | 2 | add judge plain English explanation panel | better judge readability |
 | 3 | add second and third difficulty levels to all templates | stronger world modeling story |
 | 4 | add curriculum training path | stronger self improvement story |
+| 5 | add model-backed Lab Manager using same base model with a separate role adapter | stronger multi agent depth but higher risk, reward stays deterministic, Lab Manager affects trajectory variance not reward definition |
 | 6 | add third agent such as ethics reviewer | potential partner fit extension |
 | 7 | add post episode self critique before retry | stronger self improvement story from Blueprint Section 14.2 |
 | 8 | add automatic scenario difficulty scaling | adaptive curriculum from Blueprint Section 14.2 |
 | reward too noisy or subjective | high | Person A | keep judge deterministic and rubric based |
 | final demo breaks live | high | all | keep replay logs and a pre tested demo seed ready |
 | too many scenarios | medium | Person A | ship one excellent scenario, then add more only if stable |
+| future model-backed Lab Manager increases episode variance | medium | Person B | keep rule-based Lab Manager for MVP training, introduce model-backed version only after Scientist policy is stable, use same base model with separate adapter to limit infra complexity |
 ---

docs/ayush/task_breakdown.md CHANGED Viewed

@@ -9,8 +9,8 @@ No assumptions from other documents are used to reclassify blocked status.
 ## 1. Blocking Status
-Per the source of truth, every Person B task has at least one explicit dependency.
-There are zero unblocked Person B tasks at project start.
 ---
@@ -21,24 +21,22 @@ These tasks are first gated by upstream deliverables, primarily from Person A.
 | ID | Task | Depends On | Person A Deliverable | Est |
 |----|------|-----------|---------------------|-----|
-| FND 08 | Freeze JSON contract (shared A+B) | FND 04 | Empty Pydantic models | 0.75h |
 | MOD 09 | Build output parser for ScientistAction | MOD 01 | ScientistAction schema | 0.75h |
 | AGT 01 | Draft Scientist system prompt | MOD 01, SCN 11 | ScientistAction schema + generate_scenario | 0.75h |
 | AGT 05 | Implement feasibility checker (shared A+B) | SCN 07, MOD 05 | Constraint generator + validation | 1.25h |
 | SCN 11 | Create golden scenarios for prompt testing | SCN 09 | generate_scenario() | 0.75h |
 | JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | Reward breakdown (A) + logging (C) | 0.5h |
-**Total: 6 tasks, 4.75h**
 ### What to ask Person A for first (priority order)
-1. **FND 04** (empty Pydantic models) -- unblocks FND 08 contract freeze
-2. **MOD 01** (ScientistAction schema) -- unblocks MOD 09 and, after SCN 11, AGT 01
-3. **MOD 03** (Observation models) -- unblocks AGT 02
-4. **SCN 09** (generate_scenario) -- unblocks SCN 11 golden scenarios
-5. **SCN 07 + MOD 05** (constraints + validation) -- unblocks AGT 05, AGT 06, AGT 07
-6. **JDG 05 + JDG 06** (reward breakdown + explanation) -- unblocks AGT 10 and is only part of the path for JDG 10
-7. **SCN 08** (minimum viable replication spec) -- unblocks AGT 06 after AGT 05
 ---
@@ -118,9 +116,9 @@ are done.
 All phases are gated by the listed external dependency being delivered first.
-### Phase 1: After Person A delivers FND 04
-1. **FND 08** -- Freeze JSON contract (shared with Person A, needs FND 04)
 ### Phase 2: After Person A and B complete FND 08, and Person A delivers MOD 01 + SCN 09
@@ -174,7 +172,8 @@ All phases are gated by the listed external dependency being delivered first.
 | Category | Count | Hours |
 |----------|-------|-------|
-| Blocked by Person A (first-order) | 6 | 4.75h |
 | Blocked by Person A then Person B chain | 8 | 6.25h |
 | Blocked by Person C | 3 | 2.5h |
 | Deep training chain (internal) | 11 | 7.5h |
@@ -183,19 +182,59 @@ All phases are gated by the listed external dependency being delivered first.
 ---
-## 9. Key Risks for Person B
 | Risk | Impact | Mitigation |
 |------|--------|------------|
 | Person A MOD 01-03 delayed | Blocks AGT 01, MOD 09, AGT 02-04 and all downstream | Communicate priority order to Person A early |
 | Person C API delayed | Blocks entire training pipeline (TRN 01-15) | Coordinate with Person C on API 06 timeline |
-| Base model too large for Colab | Training fails or is too slow | Pick 7B or smaller, verify Colab GPU memory first |
 | RL training produces flat rewards | No improvement to demo | Have baseline heuristic ready, tune reward weights with Person A |
 | Scientist produces invalid JSON | Rollout loop crashes | AGT 03 parse plus retry is critical, build it robust |
 ---
-## 10. Files Person B Owns
 | File | Purpose |
 |------|---------|

 ## 1. Blocking Status
+Per the source of truth, Person B now has one unblocked task.
+The immediate next task is `FND 08` because `FND 04` is complete in `replicalab/models.py`.
 ---
 | ID | Task | Depends On | Person A Deliverable | Est |
 |----|------|-----------|---------------------|-----|
 | MOD 09 | Build output parser for ScientistAction | MOD 01 | ScientistAction schema | 0.75h |
 | AGT 01 | Draft Scientist system prompt | MOD 01, SCN 11 | ScientistAction schema + generate_scenario | 0.75h |
 | AGT 05 | Implement feasibility checker (shared A+B) | SCN 07, MOD 05 | Constraint generator + validation | 1.25h |
 | SCN 11 | Create golden scenarios for prompt testing | SCN 09 | generate_scenario() | 0.75h |
 | JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | Reward breakdown (A) + logging (C) | 0.5h |
+**Total: 5 tasks, 4.0h**
 ### What to ask Person A for first (priority order)
+1. **MOD 01** (ScientistAction schema) -- unblocks MOD 09 and, after SCN 11, AGT 01
+2. **MOD 03** (Observation models) -- unblocks AGT 02
+3. **SCN 09** (generate_scenario) -- unblocks SCN 11 golden scenarios
+4. **SCN 07 + MOD 05** (constraints + validation) -- unblocks AGT 05, AGT 06, AGT 07
+5. **JDG 05 + JDG 06** (reward breakdown + explanation) -- unblocks AGT 10 and is only part of the path for JDG 10
+6. **SCN 08** (minimum viable replication spec) -- unblocks AGT 06 after AGT 05
 ---
 All phases are gated by the listed external dependency being delivered first.
+### Phase 1: Available now
+1. **FND 08** -- Freeze JSON contract (shared with Person A; unblocked because `FND 04` is complete)
 ### Phase 2: After Person A and B complete FND 08, and Person A delivers MOD 01 + SCN 09
 | Category | Count | Hours |
 |----------|-------|-------|
+| Currently unblocked | 1 | 0.75h |
+| Blocked by Person A (first-order) | 5 | 4.0h |
 | Blocked by Person A then Person B chain | 8 | 6.25h |
 | Blocked by Person C | 3 | 2.5h |
 | Deep training chain (internal) | 11 | 7.5h |
 ---
+## 9. Base Model Assumptions
+### Trainable Scientist policy
+Primary model: **Qwen3-4B**
+| Constraint | Qwen3-4B | Qwen3-8B (stretch) |
+|-----------|----------|-------------------|
+| H100 training (BF16, ~3-4x inference mem) | ~14GB weights, ~42-56GB total. Fits 80GB easily | ~19GB weights, ~57-76GB total. Tight |
+| Colab T4 (16GB, 4-bit QLoRA) | 5.5GB. Fits comfortably | 6.5GB. Fits but less headroom |
+| Structured JSON output | Good | Better |
+| RL iteration speed | Fast | Slower |
+Qwen3-8B is H100-only stretch. Use only if Qwen3-4B quality is insufficient and
+Colab demo uses a reduced-scale fallback.
+### Reward
+The training reward is always the **deterministic rubric engine** (E05 in the
+source of truth). A hosted frontier evaluator may optionally be used for
+post-episode explanation and demo audit. The frontier evaluator is never part
+of the training reward loop.
+### Future model-backed Lab Manager
+If the Lab Manager later becomes model-backed:
+- The reward formula does not change. The deterministic rubric scores the final
+  protocol against ground truth constraints regardless of how the Lab Manager
+  generates its responses.
+- Episode variance increases because the same seed may produce different
+  negotiation paths, but the scoring dimensions (rigor, feasibility, fidelity)
+  remain deterministic.
+- The pragmatic default is same base model (Qwen3-4B) with a separate
+  role-specific adapter. One base model in memory, swap adapters per turn.
+- Reward does not split into separate Scientist vs Lab Manager objectives.
+  Both roles share the same cooperative reward signal.
+---
+## 10. Key Risks for Person B
 | Risk | Impact | Mitigation |
 |------|--------|------------|
 | Person A MOD 01-03 delayed | Blocks AGT 01, MOD 09, AGT 02-04 and all downstream | Communicate priority order to Person A early |
 | Person C API delayed | Blocks entire training pipeline (TRN 01-15) | Coordinate with Person C on API 06 timeline |
+| Qwen3-4B underperforms on structured output | Scientist produces low quality protocols | Fall back to Qwen3-8B on H100, use reduced-scale Colab fallback |
 | RL training produces flat rewards | No improvement to demo | Have baseline heuristic ready, tune reward weights with Person A |
 | Scientist produces invalid JSON | Rollout loop crashes | AGT 03 parse plus retry is critical, build it robust |
+| Future model-backed Lab Manager increases variance | Slower RL convergence | Keep rule-based for MVP training, introduce model-backed only after Scientist policy is stable |
 ---
+## 11. Files Person B Owns
 | File | Purpose |
 |------|---------|

docs/ayush/task_list.md CHANGED Viewed

@@ -4,6 +4,13 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 ---
 ## Epic E02. Domain Models
 - [ ] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01
@@ -50,7 +57,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 - [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
 - [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
 - [ ] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06
-- [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01
 - [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
 ---
@@ -69,7 +76,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 ## Shared Tasks
-- [ ] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04
 ---

 ---
+## Current status
+- `FND 04` is complete in `replicalab/models.py`
+- `FND 08` is now the next unblocked Ayush task
+---
 ## Epic E02. Domain Models
 - [ ] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01
 - [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
 - [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
 - [ ] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06
+- [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Assumption: Qwen3-4B primary, Qwen3-8B H100-only stretch
 - [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
 ---
 ## Shared Tasks
+- [ ] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04 (done) | Status: ready now
 ---