# Person B (Ayush) Task Breakdown and Execution Plan Source of truth: `ReplicaLab_Comprehensive_Task_Division.md` --- ## 1. Status Ayush's implementation lane is complete. Completed tasks in this lane now cover: 1. Scientist prompting and parsing 2. Baseline Scientist policy 3. Shared deterministic Lab Manager grounding contributions 4. Notebook and reusable training stack 5. ART/OpenEnv rollout-to-trainer integration 6. Metrics, plotting, evaluation, trained-policy loading, and metadata export 7. Fresh-runtime notebook smoke validation The remaining training risk is no longer missing backlog work in Ayush's lane. It is model quality: 1. The ART/OpenEnv Scientist runtime is live and reproducible. 2. The latest live checkpoint still underperforms the deterministic baseline on held-out comparison. 3. The next useful work is experiment iteration, not infrastructure completion. --- ## 2. Final Verification State The following validation steps are now complete: 1. `scientist-preview` smoke run 2. `lab-manager-preview` smoke run 3. live `art-scientist-train` smoke run against the hosted ReplicaLab environment 4. `scientist-compare-eval` smoke run against the trained checkpoint 5. focused training-policy tests and CLI tests Smoke artifacts now exist under: 1. `replicalab/outputs/training/scientist-preview-smoke-20260308/` 2. `replicalab/outputs/training/lab-manager-preview-smoke-20260308/` 3. `replicalab/outputs/art-training/art-scientist-smoke-20260308/` 4. `replicalab/outputs/art-training/art-scientist-compare-smoke-20260308/` --- ## 3. Remaining External Work No Ayush-owned backlog items remain. Open work outside this lane that still matters to the final story: 1. `TRN 12` owned by Person D: turn evaluation outputs into judge-facing result bullets 2. UI and README result presentation tasks 3. demo-storytelling tasks These are not blockers for the training runtime itself. --- ## 4. Next Technical Focus If work continues in this lane, it should target model improvement rather than missing task closure: 1. Increase Scientist training coverage beyond the current smoke scenario set 2. Inspect failure episodes from `art-scientist-compare-20260308-step5` and `art-scientist-compare-smoke-20260308` 3. Add stronger warm-start or curriculum before more RL updates 4. Execute the Lab Manager SFT path live and evaluate its effect separately 5. Keep baseline-vs-trained comparisons on fixed seeds and frozen evidence packs 6. Track `paper_understanding` and `communication_quality` on every eval run 7. Keep the shared benchmark-history plots updating across runs 8. Use `docs/training_goals.md` as the near-term model-goals reference --- ## 5. Base Model Assumptions Primary shared base: **Qwen3.5-9B** 1. Scientist uses the shared base with a GRPO-style trainable adapter. 2. Lab Manager uses the same shared base with a separate SFT adapter. 3. `Qwen3.5-4B` remains the lower-memory fallback. 4. `Qwen3.5-122B-A10B` is an audit-only judge candidate, not the reward source. 5. The deterministic rubric remains the only training reward source. --- ## 6. Summary Table | Category | Count | Status | |----------|-------|--------| | Ayush-owned tasks remaining | 0 | Closed | | Technical blockers in Ayush lane | 0 | Closed | | Live runtime path | 1 | Validated | | Main remaining risk | 1 | Model quality, not infrastructure |