Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /training_goals.md

maxxie114's picture

Initial HF Spaces deployment

80d8c84 2 days ago

|

history blame contribute delete

2.91 kB

Training Goals

Immediate Goal

Improve the two trainable role models without destabilizing the deterministic reward loop.

The current near-term training target is not "make the models sound better." It is:

stronger paper understanding
stronger constraint grounding
cleaner Scientist-Lab Manager communication
fewer invalid or hallucinated actions
more reliable agreement on feasible plans

Role Definitions

Scientist

The Scientist should:

understand the paper hypothesis, method, key finding, and experiment goal
propose or revise protocols that stay grounded in the visible brief
ask blocking questions only when needed
avoid hallucinating tools, resources, or hidden facts
converge toward a feasible plan under lab constraints

Lab Manager / Lab Research Assistant

The Lab Manager should:

enforce budget, time, staffing, equipment, and reagent constraints
explain feasibility failures clearly
suggest grounded revisions rather than generic rejections
keep the collaboration moving toward an executable plan

Judge

The deterministic judge remains the reward source.

Optional large-model judges may be used only for:

audit text
post-run error analysis
qualitative review of failure patterns

They must not replace the deterministic rubric as the training reward.

Core Metrics To Track Every Run

average reward
agreement rate
invalid action rate
rigor
feasibility
fidelity
paper understanding
communication quality

Data Expansion Direction

The Scientist dataset should keep growing along three prompt goals:

paper_understanding
constraint_grounding
negotiation_quality

This expands coverage without changing the outer environment contract.

Current Model Mapping

Scientist: Qwen/Qwen3.5-9B
Lab Manager: Qwen/Qwen3.5-9B
Fallback: Qwen/Qwen3.5-4B
Audit-only judge candidate: Qwen/Qwen3.5-122B-A10B

Architecture Note: Execution Environment

The current environment mainly judges collaborative planning quality.

The proposed next phase is larger:

the Lab Manager allocates and configures experimental resources
the Scientist performs bounded experimental steps inside the environment
the judge scores not only negotiation quality but experimental execution quality and error recovery
paper replication is judged by reproducing the logic and outcome of the experiment, not by line-by-line paraphrase of the source paper

This is an environment redesign and should be implemented as a separate phase, not mixed into training-metric changes silently.

Guardrails

no hallucinated resources, tools, measurements, or outcomes
no hidden ground-truth leakage into model prompts
no live-web reward dependence
deterministic reward remains the training source of truth
before/after graphs must stay comparable across runs