Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /agt11_scientist_model_selection.md

maxxie114's picture

Initial HF Spaces deployment

80d8c84 2 days ago

|

history blame contribute delete

1.84 kB

AGT 11 Scientist Model Selection

Decision

The primary Northflank and local training base for both role adapters is now Qwen/Qwen3.5-9B.

The reduced-scale fallback is Qwen/Qwen3.5-4B for lower-memory smoke runs, faster iteration, and notebook fallback paths.

The optional audit-only judge model candidate is Qwen/Qwen3.5-122B-A10B. It is not part of the deterministic reward loop.

Role Mapping

Scientist: Qwen/Qwen3.5-9B + Unsloth GRPO LoRA
Lab Manager / Lab Research Assistant: Qwen/Qwen3.5-9B + Unsloth SFT LoRA
Fallback Scientist or Lab Manager: Qwen/Qwen3.5-4B
Audit-only judge candidate: Qwen/Qwen3.5-122B-A10B

Why Qwen3.5-9B For The Two Trainable Roles

It is a cleaner fit for the current Northflank H100 path than the older Qwen3-8B baseline and keeps both trainable roles on one family.
It preserves enough planning headroom for strict JSON action output, paper-grounded reasoning, and negotiation under constraints.
It still leaves a realistic fallback to the 4B variant when the team wants faster notebook iteration.

Why Keep The Judge Deterministic

The reward source must stay reproducible across runs.
A large model judge is useful for audits, narrative analysis, and post-run error review, but not for the scalar training reward.
This keeps benchmark history and before/after graphs comparable across runs.

Current Training Priorities

Measure paper understanding explicitly on every evaluation run.
Expand Scientist prompt coverage around paper understanding, constraint grounding, and negotiation quality.
Keep cumulative benchmark graphs updating across runs instead of only saving one-off plots.
Treat the execution-style lab environment as the next architecture phase, not as an untracked reward change.