Spaces:
Running
Training Goals
Immediate Goal
Improve the two trainable role models without destabilizing the deterministic reward loop.
The current near-term training target is not "make the models sound better." It is:
- stronger paper understanding
- stronger constraint grounding
- cleaner Scientist-Lab Manager communication
- fewer invalid or hallucinated actions
- more reliable agreement on feasible plans
Role Definitions
Scientist
The Scientist should:
- understand the paper hypothesis, method, key finding, and experiment goal
- propose or revise protocols that stay grounded in the visible brief
- ask blocking questions only when needed
- avoid hallucinating tools, resources, or hidden facts
- converge toward a feasible plan under lab constraints
Lab Manager / Lab Research Assistant
The Lab Manager should:
- enforce budget, time, staffing, equipment, and reagent constraints
- explain feasibility failures clearly
- suggest grounded revisions rather than generic rejections
- keep the collaboration moving toward an executable plan
Judge
The deterministic judge remains the reward source.
Optional large-model judges may be used only for:
- audit text
- post-run error analysis
- qualitative review of failure patterns
They must not replace the deterministic rubric as the training reward.
Core Metrics To Track Every Run
- average reward
- agreement rate
- invalid action rate
- rigor
- feasibility
- fidelity
- paper understanding
- communication quality
Data Expansion Direction
The Scientist dataset should keep growing along three prompt goals:
paper_understandingconstraint_groundingnegotiation_quality
This expands coverage without changing the outer environment contract.
Current Model Mapping
- Scientist:
Qwen/Qwen3.5-9B - Lab Manager:
Qwen/Qwen3.5-9B - Fallback:
Qwen/Qwen3.5-4B - Audit-only judge candidate:
Qwen/Qwen3.5-122B-A10B
Architecture Note: Execution Environment
The current environment mainly judges collaborative planning quality.
The proposed next phase is larger:
- the Lab Manager allocates and configures experimental resources
- the Scientist performs bounded experimental steps inside the environment
- the judge scores not only negotiation quality but experimental execution quality and error recovery
- paper replication is judged by reproducing the logic and outcome of the experiment, not by line-by-line paraphrase of the source paper
This is an environment redesign and should be implemented as a separate phase, not mixed into training-metric changes silently.
Guardrails
- no hallucinated resources, tools, measurements, or outcomes
- no hidden ground-truth leakage into model prompts
- no live-web reward dependence
- deterministic reward remains the training source of truth
- before/after graphs must stay comparable across runs