replicalab / docs /ayush /task_breakdown.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Person B (Ayush) Task Breakdown and Execution Plan

Source of truth: ReplicaLab_Comprehensive_Task_Division.md


1. Status

Ayush's implementation lane is complete.

Completed tasks in this lane now cover:

  1. Scientist prompting and parsing
  2. Baseline Scientist policy
  3. Shared deterministic Lab Manager grounding contributions
  4. Notebook and reusable training stack
  5. ART/OpenEnv rollout-to-trainer integration
  6. Metrics, plotting, evaluation, trained-policy loading, and metadata export
  7. Fresh-runtime notebook smoke validation

The remaining training risk is no longer missing backlog work in Ayush's lane. It is model quality:

  1. The ART/OpenEnv Scientist runtime is live and reproducible.
  2. The latest live checkpoint still underperforms the deterministic baseline on held-out comparison.
  3. The next useful work is experiment iteration, not infrastructure completion.

2. Final Verification State

The following validation steps are now complete:

  1. scientist-preview smoke run
  2. lab-manager-preview smoke run
  3. live art-scientist-train smoke run against the hosted ReplicaLab environment
  4. scientist-compare-eval smoke run against the trained checkpoint
  5. focused training-policy tests and CLI tests

Smoke artifacts now exist under:

  1. replicalab/outputs/training/scientist-preview-smoke-20260308/
  2. replicalab/outputs/training/lab-manager-preview-smoke-20260308/
  3. replicalab/outputs/art-training/art-scientist-smoke-20260308/
  4. replicalab/outputs/art-training/art-scientist-compare-smoke-20260308/

3. Remaining External Work

No Ayush-owned backlog items remain.

Open work outside this lane that still matters to the final story:

  1. TRN 12 owned by Person D: turn evaluation outputs into judge-facing result bullets
  2. UI and README result presentation tasks
  3. demo-storytelling tasks

These are not blockers for the training runtime itself.


4. Next Technical Focus

If work continues in this lane, it should target model improvement rather than missing task closure:

  1. Increase Scientist training coverage beyond the current smoke scenario set
  2. Inspect failure episodes from art-scientist-compare-20260308-step5 and art-scientist-compare-smoke-20260308
  3. Add stronger warm-start or curriculum before more RL updates
  4. Execute the Lab Manager SFT path live and evaluate its effect separately
  5. Keep baseline-vs-trained comparisons on fixed seeds and frozen evidence packs
  6. Track paper_understanding and communication_quality on every eval run
  7. Keep the shared benchmark-history plots updating across runs
  8. Use docs/training_goals.md as the near-term model-goals reference

5. Base Model Assumptions

Primary shared base: Qwen3.5-9B

  1. Scientist uses the shared base with a GRPO-style trainable adapter.
  2. Lab Manager uses the same shared base with a separate SFT adapter.
  3. Qwen3.5-4B remains the lower-memory fallback.
  4. Qwen3.5-122B-A10B is an audit-only judge candidate, not the reward source.
  5. The deterministic rubric remains the only training reward source.

6. Summary Table

Category Count Status
Ayush-owned tasks remaining 0 Closed
Technical blockers in Ayush lane 0 Closed
Live runtime path 1 Validated
Main remaining risk 1 Model quality, not infrastructure