replicalab / docs /ayush /task_breakdown.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84
# Person B (Ayush) Task Breakdown and Execution Plan
Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
---
## 1. Status
Ayush's implementation lane is complete.
Completed tasks in this lane now cover:
1. Scientist prompting and parsing
2. Baseline Scientist policy
3. Shared deterministic Lab Manager grounding contributions
4. Notebook and reusable training stack
5. ART/OpenEnv rollout-to-trainer integration
6. Metrics, plotting, evaluation, trained-policy loading, and metadata export
7. Fresh-runtime notebook smoke validation
The remaining training risk is no longer missing backlog work in Ayush's lane.
It is model quality:
1. The ART/OpenEnv Scientist runtime is live and reproducible.
2. The latest live checkpoint still underperforms the deterministic baseline on held-out comparison.
3. The next useful work is experiment iteration, not infrastructure completion.
---
## 2. Final Verification State
The following validation steps are now complete:
1. `scientist-preview` smoke run
2. `lab-manager-preview` smoke run
3. live `art-scientist-train` smoke run against the hosted ReplicaLab environment
4. `scientist-compare-eval` smoke run against the trained checkpoint
5. focused training-policy tests and CLI tests
Smoke artifacts now exist under:
1. `replicalab/outputs/training/scientist-preview-smoke-20260308/`
2. `replicalab/outputs/training/lab-manager-preview-smoke-20260308/`
3. `replicalab/outputs/art-training/art-scientist-smoke-20260308/`
4. `replicalab/outputs/art-training/art-scientist-compare-smoke-20260308/`
---
## 3. Remaining External Work
No Ayush-owned backlog items remain.
Open work outside this lane that still matters to the final story:
1. `TRN 12` owned by Person D: turn evaluation outputs into judge-facing result bullets
2. UI and README result presentation tasks
3. demo-storytelling tasks
These are not blockers for the training runtime itself.
---
## 4. Next Technical Focus
If work continues in this lane, it should target model improvement rather than missing task closure:
1. Increase Scientist training coverage beyond the current smoke scenario set
2. Inspect failure episodes from `art-scientist-compare-20260308-step5` and `art-scientist-compare-smoke-20260308`
3. Add stronger warm-start or curriculum before more RL updates
4. Execute the Lab Manager SFT path live and evaluate its effect separately
5. Keep baseline-vs-trained comparisons on fixed seeds and frozen evidence packs
6. Track `paper_understanding` and `communication_quality` on every eval run
7. Keep the shared benchmark-history plots updating across runs
8. Use `docs/training_goals.md` as the near-term model-goals reference
---
## 5. Base Model Assumptions
Primary shared base: **Qwen3.5-9B**
1. Scientist uses the shared base with a GRPO-style trainable adapter.
2. Lab Manager uses the same shared base with a separate SFT adapter.
3. `Qwen3.5-4B` remains the lower-memory fallback.
4. `Qwen3.5-122B-A10B` is an audit-only judge candidate, not the reward source.
5. The deterministic rubric remains the only training reward source.
---
## 6. Summary Table
| Category | Count | Status |
|----------|-------|--------|
| Ayush-owned tasks remaining | 0 | Closed |
| Technical blockers in Ayush lane | 0 | Closed |
| Live runtime path | 1 | Validated |
| Main remaining risk | 1 | Model quality, not infrastructure |