Spaces:
Running
Running
AGT 11 Scientist Model Selection
Decision
The primary Northflank and local training base for both role adapters is now Qwen/Qwen3.5-9B.
The reduced-scale fallback is Qwen/Qwen3.5-4B for lower-memory smoke runs, faster iteration, and notebook fallback paths.
The optional audit-only judge model candidate is Qwen/Qwen3.5-122B-A10B. It is not part of the deterministic reward loop.
Role Mapping
- Scientist:
Qwen/Qwen3.5-9B+ Unsloth GRPO LoRA - Lab Manager / Lab Research Assistant:
Qwen/Qwen3.5-9B+ Unsloth SFT LoRA - Fallback Scientist or Lab Manager:
Qwen/Qwen3.5-4B - Audit-only judge candidate:
Qwen/Qwen3.5-122B-A10B
Why Qwen3.5-9B For The Two Trainable Roles
- It is a cleaner fit for the current Northflank H100 path than the older
Qwen3-8Bbaseline and keeps both trainable roles on one family. - It preserves enough planning headroom for strict JSON action output, paper-grounded reasoning, and negotiation under constraints.
- It still leaves a realistic fallback to the 4B variant when the team wants faster notebook iteration.
Why Keep The Judge Deterministic
- The reward source must stay reproducible across runs.
- A large model judge is useful for audits, narrative analysis, and post-run error review, but not for the scalar training reward.
- This keeps benchmark history and before/after graphs comparable across runs.
Current Training Priorities
- Measure paper understanding explicitly on every evaluation run.
- Expand Scientist prompt coverage around paper understanding, constraint grounding, and negotiation quality.
- Keep cumulative benchmark graphs updating across runs instead of only saving one-off plots.
- Treat the execution-style lab environment as the next architecture phase, not as an untracked reward change.