Spaces:
Running
Running
Add model selection and architecture notes
Browse files- Qwen3-4B as primary trainable Scientist, Qwen3-8B as H100 stretch
- Deterministic rubric remains sole training reward
- Hosted frontier evaluator for optional explanation and demo audit only
- Lab Manager stays rule-based for MVP
- Future model-backed Lab Manager added to stretch backlog and risk register
- Updated Person B docs with base-model rationale and reward notes
- ReplicaLab_Comprehensive_Task_Division.md +32 -5
- docs/ayush/task_breakdown.md +56 -17
- docs/ayush/task_list.md +9 -2
ReplicaLab_Comprehensive_Task_Division.md
CHANGED
|
@@ -96,13 +96,35 @@ By judging time, the project should demonstrate:
|
|
| 96 |
| Storytelling | everyone contributes screenshots, gifs, examples |
|
| 97 |
| Submission readiness | all four review final demo, notebook, README, repo visibility |
|
| 98 |
|
| 99 |
-
## 4.1 Training compute
|
| 100 |
|
| 101 |
1. The team has access to an H100 GPU for heavier Scientist training and evaluation runs.
|
| 102 |
2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
|
| 103 |
3. The judged artifact remains the Colab notebook, so any H100 run must still have a documented notebook path or reduced scale fallback that can be shown in Colab.
|
| 104 |
4. Person C supports any environment URL, secret, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
|
| 105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
---
|
| 107 |
|
| 108 |
## 5. Module and function ownership map
|
|
@@ -209,12 +231,16 @@ Create a stable shared codebase, contracts, and development workflow so all work
|
|
| 209 |
|
| 210 |
- `FND 01` status: completed on 2026-03-07
|
| 211 |
- `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
|
|
|
|
|
|
|
| 212 |
- `FND 10` status: completed on 2026-03-07
|
| 213 |
- `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
|
| 214 |
- Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
|
|
|
|
| 215 |
- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
|
| 216 |
-
- Remaining work now unblocked by `FND 01`: `FND 02`, `FND 03`, `FND
|
| 217 |
-
-
|
|
|
|
| 218 |
|
| 219 |
### User stories
|
| 220 |
|
|
@@ -231,7 +257,7 @@ As a team, we want agreed schemas and coding rules so integration risk stays low
|
|
| 231 |
| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
|
| 232 |
| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ⬜ Not started | — |
|
| 233 |
| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ⬜ Not started | — |
|
| 234 |
-
| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models |
|
| 235 |
| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ⬜ Not started | — |
|
| 236 |
| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ⬜ Not started | — |
|
| 237 |
| FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | ⬜ Not started | — |
|
|
@@ -707,7 +733,7 @@ The MVP is complete when all of the following are true:
|
|
| 707 |
| 2 | add judge plain English explanation panel | better judge readability |
|
| 708 |
| 3 | add second and third difficulty levels to all templates | stronger world modeling story |
|
| 709 |
| 4 | add curriculum training path | stronger self improvement story |
|
| 710 |
-
| 5 | add
|
| 711 |
| 6 | add third agent such as ethics reviewer | potential partner fit extension |
|
| 712 |
| 7 | add post episode self critique before retry | stronger self improvement story from Blueprint Section 14.2 |
|
| 713 |
| 8 | add automatic scenario difficulty scaling | adaptive curriculum from Blueprint Section 14.2 |
|
|
@@ -725,6 +751,7 @@ The MVP is complete when all of the following are true:
|
|
| 725 |
| reward too noisy or subjective | high | Person A | keep judge deterministic and rubric based |
|
| 726 |
| final demo breaks live | high | all | keep replay logs and a pre tested demo seed ready |
|
| 727 |
| too many scenarios | medium | Person A | ship one excellent scenario, then add more only if stable |
|
|
|
|
| 728 |
|
| 729 |
---
|
| 730 |
|
|
|
|
| 96 |
| Storytelling | everyone contributes screenshots, gifs, examples |
|
| 97 |
| Submission readiness | all four review final demo, notebook, README, repo visibility |
|
| 98 |
|
| 99 |
+
## 4.1 Training compute and model selection
|
| 100 |
|
| 101 |
1. The team has access to an H100 GPU for heavier Scientist training and evaluation runs.
|
| 102 |
2. Person B is the primary owner of that compute for RL tasks, especially `TRN 04` to `TRN 10`, `TRN 13` to `TRN 15`, `OBS 06`, and `TST 09`.
|
| 103 |
3. The judged artifact remains the Colab notebook, so any H100 run must still have a documented notebook path or reduced scale fallback that can be shown in Colab.
|
| 104 |
4. Person C supports any environment URL, secret, or infra setup needed so the H100 training run can connect to the same backend contract as the notebook.
|
| 105 |
|
| 106 |
+
### Trainable model
|
| 107 |
+
|
| 108 |
+
The primary trainable model for the Scientist policy is **Qwen3-4B**.
|
| 109 |
+
|
| 110 |
+
| Model | Role | Rationale |
|
| 111 |
+
| --- | --- | --- |
|
| 112 |
+
| Qwen3-4B | Primary Scientist policy | BF16 fits H100 (~14GB weights, ~42-56GB training). 4-bit fits Colab T4 (5.5GB). Strong structured output for JSON action schemas. Fast RL iteration speed. |
|
| 113 |
+
| Qwen3-8B | H100-only stretch | Better reasoning quality but 4-bit barely fits Colab T4 (6.5GB). Use only if Qwen3-4B quality is insufficient and Colab demo uses reduced-scale fallback. |
|
| 114 |
+
|
| 115 |
+
### Evaluator layer
|
| 116 |
+
|
| 117 |
+
The training reward is always the **deterministic rubric engine** defined in E05. A hosted frontier evaluator may optionally be used for post-episode explanation and demo audit only. The frontier evaluator is never part of the training reward loop.
|
| 118 |
+
|
| 119 |
+
### MVP role implementations
|
| 120 |
+
|
| 121 |
+
| Role | MVP implementation | Future stretch |
|
| 122 |
+
| --- | --- | --- |
|
| 123 |
+
| Scientist | Trainable policy (Qwen3-4B) | Qwen3-8B if quality insufficient |
|
| 124 |
+
| Lab Manager | Rule-based deterministic policy | Model-backed policy using same base model with separate adapter |
|
| 125 |
+
| Judge (training reward) | Deterministic rubric engine | Unchanged |
|
| 126 |
+
| Judge (explanation layer) | Optional hosted frontier evaluator | Extended explanation panel in UI |
|
| 127 |
+
|
| 128 |
---
|
| 129 |
|
| 130 |
## 5. Module and function ownership map
|
|
|
|
| 231 |
|
| 232 |
- `FND 01` status: completed on 2026-03-07
|
| 233 |
- `FND 01` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
|
| 234 |
+
- `FND 04` status: completed on 2026-03-08
|
| 235 |
+
- `FND 04` completed by: `Person B (Ayush)` while the assigned owner remains `Person A`
|
| 236 |
- `FND 10` status: completed on 2026-03-07
|
| 237 |
- `FND 10` completed by: `Person B (Ayush)` while the assigned owner remains `Person C`
|
| 238 |
- Completed scope for `FND 01`: created the agreed repo scaffold for `replicalab/`, `server/`, `frontend/`, `notebooks/`, and `tests/`, including the initial `replicalab/*` and `frontend/src/*` subfolders from the planned layout
|
| 239 |
+
- Completed scope for `FND 04`: added importable empty Pydantic model stubs in `replicalab/models.py` for the shared action, observation, step, state, and log contracts
|
| 240 |
- Completed scope for `FND 10`: created `replicalab/outputs/` with tracked `logs/`, `replays/`, and `plots/` subdirectories
|
| 241 |
+
- Remaining work now unblocked by `FND 01`: `FND 02`, `FND 03`, `FND 05`, `FND 06`, `FND 07`
|
| 242 |
+
- Newly unblocked by `FND 04`: `FND 08`, `FND 09`
|
| 243 |
+
- Remaining Epic E01 work still gated by follow-on dependencies: `FND 11`, `FND 12`, `FND 13`
|
| 244 |
|
| 245 |
### User stories
|
| 246 |
|
|
|
|
| 257 |
| FND 01 | E01.1 | Person C | repo root | Create repo structure and base folders from agreed layout | none | 0.5h | all top level folders exist and repo clones cleanly | ✅ Completed | Person B (Ayush) |
|
| 258 |
| FND 02 | E01.1 | Person C | `pyproject.toml` | Add Python project config and dependencies placeholder | FND 01 | 0.5h | project installs locally without missing package errors for base modules | ⬜ Not started | — |
|
| 259 |
| FND 03 | E01.1 | Person C | `frontend/package.json` | Initialize React plus Vite frontend shell | FND 01 | 0.5h | `npm install` and dev server run successfully | ⬜ Not started | — |
|
| 260 |
+
| FND 04 | E01.2 | Person A | `replicalab/models.py` | Add empty Pydantic models and shared type names | FND 01 | 0.5h | import paths resolve for all placeholder models | ✅ Completed | Person B (Ayush) |
|
| 261 |
| FND 05 | E01.2 | Person C | `.gitignore` and `.dockerignore` | Add ignore rules for Python, Node, logs, notebooks, and build artifacts. `.dockerignore` must explicitly exclude `.git`, `node_modules`, `notebooks/`, `tests/`, `__pycache__`, `.venv`, and output files to keep the Docker image lean | FND 01 | 0.25h | repo status stays clean after local run and build, and Docker build excludes non-runtime files | ⬜ Not started | — |
|
| 262 |
| FND 06 | E01.2 | Person D | `README.md` | Add temporary project stub with title, mission, team roles, and local setup placeholder | FND 01 | 0.5h | new contributor can understand repo purpose in under two minutes | ⬜ Not started | — |
|
| 263 |
| FND 07 | E01.2 | Person C | repo settings | Define branch naming, PR template, and issue template | FND 01 | 0.5h | all future PRs auto show the template and issue fields | ⬜ Not started | — |
|
|
|
|
| 733 |
| 2 | add judge plain English explanation panel | better judge readability |
|
| 734 |
| 3 | add second and third difficulty levels to all templates | stronger world modeling story |
|
| 735 |
| 4 | add curriculum training path | stronger self improvement story |
|
| 736 |
+
| 5 | add model-backed Lab Manager using same base model with a separate role adapter | stronger multi agent depth but higher risk, reward stays deterministic, Lab Manager affects trajectory variance not reward definition |
|
| 737 |
| 6 | add third agent such as ethics reviewer | potential partner fit extension |
|
| 738 |
| 7 | add post episode self critique before retry | stronger self improvement story from Blueprint Section 14.2 |
|
| 739 |
| 8 | add automatic scenario difficulty scaling | adaptive curriculum from Blueprint Section 14.2 |
|
|
|
|
| 751 |
| reward too noisy or subjective | high | Person A | keep judge deterministic and rubric based |
|
| 752 |
| final demo breaks live | high | all | keep replay logs and a pre tested demo seed ready |
|
| 753 |
| too many scenarios | medium | Person A | ship one excellent scenario, then add more only if stable |
|
| 754 |
+
| future model-backed Lab Manager increases episode variance | medium | Person B | keep rule-based Lab Manager for MVP training, introduce model-backed version only after Scientist policy is stable, use same base model with separate adapter to limit infra complexity |
|
| 755 |
|
| 756 |
---
|
| 757 |
|
docs/ayush/task_breakdown.md
CHANGED
|
@@ -9,8 +9,8 @@ No assumptions from other documents are used to reclassify blocked status.
|
|
| 9 |
|
| 10 |
## 1. Blocking Status
|
| 11 |
|
| 12 |
-
Per the source of truth,
|
| 13 |
-
|
| 14 |
|
| 15 |
---
|
| 16 |
|
|
@@ -21,24 +21,22 @@ These tasks are first gated by upstream deliverables, primarily from Person A.
|
|
| 21 |
|
| 22 |
| ID | Task | Depends On | Person A Deliverable | Est |
|
| 23 |
|----|------|-----------|---------------------|-----|
|
| 24 |
-
| FND 08 | Freeze JSON contract (shared A+B) | FND 04 | Empty Pydantic models | 0.75h |
|
| 25 |
| MOD 09 | Build output parser for ScientistAction | MOD 01 | ScientistAction schema | 0.75h |
|
| 26 |
| AGT 01 | Draft Scientist system prompt | MOD 01, SCN 11 | ScientistAction schema + generate_scenario | 0.75h |
|
| 27 |
| AGT 05 | Implement feasibility checker (shared A+B) | SCN 07, MOD 05 | Constraint generator + validation | 1.25h |
|
| 28 |
| SCN 11 | Create golden scenarios for prompt testing | SCN 09 | generate_scenario() | 0.75h |
|
| 29 |
| JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | Reward breakdown (A) + logging (C) | 0.5h |
|
| 30 |
|
| 31 |
-
**Total:
|
| 32 |
|
| 33 |
### What to ask Person A for first (priority order)
|
| 34 |
|
| 35 |
-
1. **
|
| 36 |
-
2. **MOD
|
| 37 |
-
3. **
|
| 38 |
-
4. **SCN
|
| 39 |
-
5. **
|
| 40 |
-
6. **
|
| 41 |
-
7. **SCN 08** (minimum viable replication spec) -- unblocks AGT 06 after AGT 05
|
| 42 |
|
| 43 |
---
|
| 44 |
|
|
@@ -118,9 +116,9 @@ are done.
|
|
| 118 |
|
| 119 |
All phases are gated by the listed external dependency being delivered first.
|
| 120 |
|
| 121 |
-
### Phase 1:
|
| 122 |
|
| 123 |
-
1. **FND 08** -- Freeze JSON contract (shared with Person A
|
| 124 |
|
| 125 |
### Phase 2: After Person A and B complete FND 08, and Person A delivers MOD 01 + SCN 09
|
| 126 |
|
|
@@ -174,7 +172,8 @@ All phases are gated by the listed external dependency being delivered first.
|
|
| 174 |
|
| 175 |
| Category | Count | Hours |
|
| 176 |
|----------|-------|-------|
|
| 177 |
-
|
|
|
|
|
| 178 |
| Blocked by Person A then Person B chain | 8 | 6.25h |
|
| 179 |
| Blocked by Person C | 3 | 2.5h |
|
| 180 |
| Deep training chain (internal) | 11 | 7.5h |
|
|
@@ -183,19 +182,59 @@ All phases are gated by the listed external dependency being delivered first.
|
|
| 183 |
|
| 184 |
---
|
| 185 |
|
| 186 |
-
## 9.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
| Risk | Impact | Mitigation |
|
| 189 |
|------|--------|------------|
|
| 190 |
| Person A MOD 01-03 delayed | Blocks AGT 01, MOD 09, AGT 02-04 and all downstream | Communicate priority order to Person A early |
|
| 191 |
| Person C API delayed | Blocks entire training pipeline (TRN 01-15) | Coordinate with Person C on API 06 timeline |
|
| 192 |
-
|
|
| 193 |
| RL training produces flat rewards | No improvement to demo | Have baseline heuristic ready, tune reward weights with Person A |
|
| 194 |
| Scientist produces invalid JSON | Rollout loop crashes | AGT 03 parse plus retry is critical, build it robust |
|
|
|
|
| 195 |
|
| 196 |
---
|
| 197 |
|
| 198 |
-
##
|
| 199 |
|
| 200 |
| File | Purpose |
|
| 201 |
|------|---------|
|
|
|
|
| 9 |
|
| 10 |
## 1. Blocking Status
|
| 11 |
|
| 12 |
+
Per the source of truth, Person B now has one unblocked task.
|
| 13 |
+
The immediate next task is `FND 08` because `FND 04` is complete in `replicalab/models.py`.
|
| 14 |
|
| 15 |
---
|
| 16 |
|
|
|
|
| 21 |
|
| 22 |
| ID | Task | Depends On | Person A Deliverable | Est |
|
| 23 |
|----|------|-----------|---------------------|-----|
|
|
|
|
| 24 |
| MOD 09 | Build output parser for ScientistAction | MOD 01 | ScientistAction schema | 0.75h |
|
| 25 |
| AGT 01 | Draft Scientist system prompt | MOD 01, SCN 11 | ScientistAction schema + generate_scenario | 0.75h |
|
| 26 |
| AGT 05 | Implement feasibility checker (shared A+B) | SCN 07, MOD 05 | Constraint generator + validation | 1.25h |
|
| 27 |
| SCN 11 | Create golden scenarios for prompt testing | SCN 09 | generate_scenario() | 0.75h |
|
| 28 |
| JDG 10 | Expose component metrics for training plots | JDG 05, JDG 07 | Reward breakdown (A) + logging (C) | 0.5h |
|
| 29 |
|
| 30 |
+
**Total: 5 tasks, 4.0h**
|
| 31 |
|
| 32 |
### What to ask Person A for first (priority order)
|
| 33 |
|
| 34 |
+
1. **MOD 01** (ScientistAction schema) -- unblocks MOD 09 and, after SCN 11, AGT 01
|
| 35 |
+
2. **MOD 03** (Observation models) -- unblocks AGT 02
|
| 36 |
+
3. **SCN 09** (generate_scenario) -- unblocks SCN 11 golden scenarios
|
| 37 |
+
4. **SCN 07 + MOD 05** (constraints + validation) -- unblocks AGT 05, AGT 06, AGT 07
|
| 38 |
+
5. **JDG 05 + JDG 06** (reward breakdown + explanation) -- unblocks AGT 10 and is only part of the path for JDG 10
|
| 39 |
+
6. **SCN 08** (minimum viable replication spec) -- unblocks AGT 06 after AGT 05
|
|
|
|
| 40 |
|
| 41 |
---
|
| 42 |
|
|
|
|
| 116 |
|
| 117 |
All phases are gated by the listed external dependency being delivered first.
|
| 118 |
|
| 119 |
+
### Phase 1: Available now
|
| 120 |
|
| 121 |
+
1. **FND 08** -- Freeze JSON contract (shared with Person A; unblocked because `FND 04` is complete)
|
| 122 |
|
| 123 |
### Phase 2: After Person A and B complete FND 08, and Person A delivers MOD 01 + SCN 09
|
| 124 |
|
|
|
|
| 172 |
|
| 173 |
| Category | Count | Hours |
|
| 174 |
|----------|-------|-------|
|
| 175 |
+
| Currently unblocked | 1 | 0.75h |
|
| 176 |
+
| Blocked by Person A (first-order) | 5 | 4.0h |
|
| 177 |
| Blocked by Person A then Person B chain | 8 | 6.25h |
|
| 178 |
| Blocked by Person C | 3 | 2.5h |
|
| 179 |
| Deep training chain (internal) | 11 | 7.5h |
|
|
|
|
| 182 |
|
| 183 |
---
|
| 184 |
|
| 185 |
+
## 9. Base Model Assumptions
|
| 186 |
+
|
| 187 |
+
### Trainable Scientist policy
|
| 188 |
+
|
| 189 |
+
Primary model: **Qwen3-4B**
|
| 190 |
+
|
| 191 |
+
| Constraint | Qwen3-4B | Qwen3-8B (stretch) |
|
| 192 |
+
|-----------|----------|-------------------|
|
| 193 |
+
| H100 training (BF16, ~3-4x inference mem) | ~14GB weights, ~42-56GB total. Fits 80GB easily | ~19GB weights, ~57-76GB total. Tight |
|
| 194 |
+
| Colab T4 (16GB, 4-bit QLoRA) | 5.5GB. Fits comfortably | 6.5GB. Fits but less headroom |
|
| 195 |
+
| Structured JSON output | Good | Better |
|
| 196 |
+
| RL iteration speed | Fast | Slower |
|
| 197 |
+
|
| 198 |
+
Qwen3-8B is H100-only stretch. Use only if Qwen3-4B quality is insufficient and
|
| 199 |
+
Colab demo uses a reduced-scale fallback.
|
| 200 |
+
|
| 201 |
+
### Reward
|
| 202 |
+
|
| 203 |
+
The training reward is always the **deterministic rubric engine** (E05 in the
|
| 204 |
+
source of truth). A hosted frontier evaluator may optionally be used for
|
| 205 |
+
post-episode explanation and demo audit. The frontier evaluator is never part
|
| 206 |
+
of the training reward loop.
|
| 207 |
+
|
| 208 |
+
### Future model-backed Lab Manager
|
| 209 |
+
|
| 210 |
+
If the Lab Manager later becomes model-backed:
|
| 211 |
+
- The reward formula does not change. The deterministic rubric scores the final
|
| 212 |
+
protocol against ground truth constraints regardless of how the Lab Manager
|
| 213 |
+
generates its responses.
|
| 214 |
+
- Episode variance increases because the same seed may produce different
|
| 215 |
+
negotiation paths, but the scoring dimensions (rigor, feasibility, fidelity)
|
| 216 |
+
remain deterministic.
|
| 217 |
+
- The pragmatic default is same base model (Qwen3-4B) with a separate
|
| 218 |
+
role-specific adapter. One base model in memory, swap adapters per turn.
|
| 219 |
+
- Reward does not split into separate Scientist vs Lab Manager objectives.
|
| 220 |
+
Both roles share the same cooperative reward signal.
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
+
|
| 224 |
+
## 10. Key Risks for Person B
|
| 225 |
|
| 226 |
| Risk | Impact | Mitigation |
|
| 227 |
|------|--------|------------|
|
| 228 |
| Person A MOD 01-03 delayed | Blocks AGT 01, MOD 09, AGT 02-04 and all downstream | Communicate priority order to Person A early |
|
| 229 |
| Person C API delayed | Blocks entire training pipeline (TRN 01-15) | Coordinate with Person C on API 06 timeline |
|
| 230 |
+
| Qwen3-4B underperforms on structured output | Scientist produces low quality protocols | Fall back to Qwen3-8B on H100, use reduced-scale Colab fallback |
|
| 231 |
| RL training produces flat rewards | No improvement to demo | Have baseline heuristic ready, tune reward weights with Person A |
|
| 232 |
| Scientist produces invalid JSON | Rollout loop crashes | AGT 03 parse plus retry is critical, build it robust |
|
| 233 |
+
| Future model-backed Lab Manager increases variance | Slower RL convergence | Keep rule-based for MVP training, introduce model-backed only after Scientist policy is stable |
|
| 234 |
|
| 235 |
---
|
| 236 |
|
| 237 |
+
## 11. Files Person B Owns
|
| 238 |
|
| 239 |
| File | Purpose |
|
| 240 |
|------|---------|
|
docs/ayush/task_list.md
CHANGED
|
@@ -4,6 +4,13 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## Epic E02. Domain Models
|
| 8 |
|
| 9 |
- [ ] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01
|
|
@@ -50,7 +57,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 50 |
- [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
|
| 51 |
- [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
|
| 52 |
- [ ] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06
|
| 53 |
-
- [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01
|
| 54 |
- [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
|
| 55 |
|
| 56 |
---
|
|
@@ -69,7 +76,7 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 69 |
|
| 70 |
## Shared Tasks
|
| 71 |
|
| 72 |
-
- [ ] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04
|
| 73 |
|
| 74 |
---
|
| 75 |
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## Current status
|
| 8 |
+
|
| 9 |
+
- `FND 04` is complete in `replicalab/models.py`
|
| 10 |
+
- `FND 08` is now the next unblocked Ayush task
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
## Epic E02. Domain Models
|
| 15 |
|
| 16 |
- [ ] **MOD 09** | Add output parser that maps model text to `ScientistAction` | 0.75h | Depends: MOD 01
|
|
|
|
| 57 |
- [ ] **TRN 09** | Add policy loading path for trained adapter | 0.5h | Depends: TRN 05
|
| 58 |
- [ ] **TRN 10** | Export plot image and sample logs to outputs/plots | 0.25h | Depends: TRN 07
|
| 59 |
- [ ] **TRN 13** | Create reusable environment client module (client.py) | 1h | Depends: API 06
|
| 60 |
+
- [ ] **TRN 14** | Select and document base model (notebook side) | 0.5h | Depends: TRN 01 | Assumption: Qwen3-4B primary, Qwen3-8B H100-only stretch
|
| 61 |
- [ ] **TRN 15** | Add agreement rate and invalid action rate aggregation | 0.5h | Depends: TRN 06, TRN 08, OBS 09
|
| 62 |
|
| 63 |
---
|
|
|
|
| 76 |
|
| 77 |
## Shared Tasks
|
| 78 |
|
| 79 |
+
- [ ] **FND 08** | Freeze JSON contract for actions and observations (with Person A) | 0.75h | Depends: FND 04 (done) | Status: ready now
|
| 80 |
|
| 81 |
---
|
| 82 |
|