# Training Goals ## Immediate Goal Improve the two trainable role models without destabilizing the deterministic reward loop. The current near-term training target is not "make the models sound better." It is: 1. stronger paper understanding 2. stronger constraint grounding 3. cleaner Scientist-Lab Manager communication 4. fewer invalid or hallucinated actions 5. more reliable agreement on feasible plans ## Role Definitions ### Scientist The Scientist should: 1. understand the paper hypothesis, method, key finding, and experiment goal 2. propose or revise protocols that stay grounded in the visible brief 3. ask blocking questions only when needed 4. avoid hallucinating tools, resources, or hidden facts 5. converge toward a feasible plan under lab constraints ### Lab Manager / Lab Research Assistant The Lab Manager should: 1. enforce budget, time, staffing, equipment, and reagent constraints 2. explain feasibility failures clearly 3. suggest grounded revisions rather than generic rejections 4. keep the collaboration moving toward an executable plan ### Judge The deterministic judge remains the reward source. Optional large-model judges may be used only for: 1. audit text 2. post-run error analysis 3. qualitative review of failure patterns They must not replace the deterministic rubric as the training reward. ## Core Metrics To Track Every Run 1. average reward 2. agreement rate 3. invalid action rate 4. rigor 5. feasibility 6. fidelity 7. paper understanding 8. communication quality ## Data Expansion Direction The Scientist dataset should keep growing along three prompt goals: 1. `paper_understanding` 2. `constraint_grounding` 3. `negotiation_quality` This expands coverage without changing the outer environment contract. ## Current Model Mapping 1. Scientist: `Qwen/Qwen3.5-9B` 2. Lab Manager: `Qwen/Qwen3.5-9B` 3. Fallback: `Qwen/Qwen3.5-4B` 4. Audit-only judge candidate: `Qwen/Qwen3.5-122B-A10B` ## Architecture Note: Execution Environment The current environment mainly judges collaborative planning quality. The proposed next phase is larger: 1. the Lab Manager allocates and configures experimental resources 2. the Scientist performs bounded experimental steps inside the environment 3. the judge scores not only negotiation quality but experimental execution quality and error recovery 4. paper replication is judged by reproducing the logic and outcome of the experiment, not by line-by-line paraphrase of the source paper This is an environment redesign and should be implemented as a separate phase, not mixed into training-metric changes silently. ## Guardrails 1. no hallucinated resources, tools, measurements, or outcomes 2. no hidden ground-truth leakage into model prompts 3. no live-web reward dependence 4. deterministic reward remains the training source of truth 5. before/after graphs must stay comparable across runs