Spaces:
Running
Running
Person B (Ayush) Task Breakdown and Execution Plan
Source of truth: ReplicaLab_Comprehensive_Task_Division.md
1. Status
Ayush's implementation lane is complete.
Completed tasks in this lane now cover:
- Scientist prompting and parsing
- Baseline Scientist policy
- Shared deterministic Lab Manager grounding contributions
- Notebook and reusable training stack
- ART/OpenEnv rollout-to-trainer integration
- Metrics, plotting, evaluation, trained-policy loading, and metadata export
- Fresh-runtime notebook smoke validation
The remaining training risk is no longer missing backlog work in Ayush's lane. It is model quality:
- The ART/OpenEnv Scientist runtime is live and reproducible.
- The latest live checkpoint still underperforms the deterministic baseline on held-out comparison.
- The next useful work is experiment iteration, not infrastructure completion.
2. Final Verification State
The following validation steps are now complete:
scientist-previewsmoke runlab-manager-previewsmoke run- live
art-scientist-trainsmoke run against the hosted ReplicaLab environment scientist-compare-evalsmoke run against the trained checkpoint- focused training-policy tests and CLI tests
Smoke artifacts now exist under:
replicalab/outputs/training/scientist-preview-smoke-20260308/replicalab/outputs/training/lab-manager-preview-smoke-20260308/replicalab/outputs/art-training/art-scientist-smoke-20260308/replicalab/outputs/art-training/art-scientist-compare-smoke-20260308/
3. Remaining External Work
No Ayush-owned backlog items remain.
Open work outside this lane that still matters to the final story:
TRN 12owned by Person D: turn evaluation outputs into judge-facing result bullets- UI and README result presentation tasks
- demo-storytelling tasks
These are not blockers for the training runtime itself.
4. Next Technical Focus
If work continues in this lane, it should target model improvement rather than missing task closure:
- Increase Scientist training coverage beyond the current smoke scenario set
- Inspect failure episodes from
art-scientist-compare-20260308-step5andart-scientist-compare-smoke-20260308 - Add stronger warm-start or curriculum before more RL updates
- Execute the Lab Manager SFT path live and evaluate its effect separately
- Keep baseline-vs-trained comparisons on fixed seeds and frozen evidence packs
- Track
paper_understandingandcommunication_qualityon every eval run - Keep the shared benchmark-history plots updating across runs
- Use
docs/training_goals.mdas the near-term model-goals reference
5. Base Model Assumptions
Primary shared base: Qwen3.5-9B
- Scientist uses the shared base with a GRPO-style trainable adapter.
- Lab Manager uses the same shared base with a separate SFT adapter.
Qwen3.5-4Bremains the lower-memory fallback.Qwen3.5-122B-A10Bis an audit-only judge candidate, not the reward source.- The deterministic rubric remains the only training reward source.
6. Summary Table
| Category | Count | Status |
|---|---|---|
| Ayush-owned tasks remaining | 0 | Closed |
| Technical blockers in Ayush lane | 0 | Closed |
| Live runtime path | 1 | Validated |
| Main remaining risk | 1 | Model quality, not infrastructure |