Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /training_goals.md

maxxie114

Initial HF Spaces deployment

80d8c84 26 days ago

preview code

raw

history blame contribute delete

2.91 kB

	# Training Goals

	## Immediate Goal

	Improve the two trainable role models without destabilizing the deterministic
	reward loop.

	The current near-term training target is not "make the models sound better."
	It is:

	1. stronger paper understanding
	2. stronger constraint grounding
	3. cleaner Scientist-Lab Manager communication
	4. fewer invalid or hallucinated actions
	5. more reliable agreement on feasible plans

	## Role Definitions

	### Scientist

	The Scientist should:

	1. understand the paper hypothesis, method, key finding, and experiment goal
	2. propose or revise protocols that stay grounded in the visible brief
	3. ask blocking questions only when needed
	4. avoid hallucinating tools, resources, or hidden facts
	5. converge toward a feasible plan under lab constraints

	### Lab Manager / Lab Research Assistant

	The Lab Manager should:

	1. enforce budget, time, staffing, equipment, and reagent constraints
	2. explain feasibility failures clearly
	3. suggest grounded revisions rather than generic rejections
	4. keep the collaboration moving toward an executable plan

	### Judge

	The deterministic judge remains the reward source.

	Optional large-model judges may be used only for:

	1. audit text
	2. post-run error analysis
	3. qualitative review of failure patterns

	They must not replace the deterministic rubric as the training reward.

	## Core Metrics To Track Every Run

	1. average reward
	2. agreement rate
	3. invalid action rate
	4. rigor
	5. feasibility
	6. fidelity
	7. paper understanding
	8. communication quality

	## Data Expansion Direction

	The Scientist dataset should keep growing along three prompt goals:

	1. `paper_understanding`
	2. `constraint_grounding`
	3. `negotiation_quality`

	This expands coverage without changing the outer environment contract.

	## Current Model Mapping

	1. Scientist: `Qwen/Qwen3.5-9B`
	2. Lab Manager: `Qwen/Qwen3.5-9B`
	3. Fallback: `Qwen/Qwen3.5-4B`
	4. Audit-only judge candidate: `Qwen/Qwen3.5-122B-A10B`

	## Architecture Note: Execution Environment

	The current environment mainly judges collaborative planning quality.

	The proposed next phase is larger:

	1. the Lab Manager allocates and configures experimental resources
	2. the Scientist performs bounded experimental steps inside the environment
	3. the judge scores not only negotiation quality but experimental execution
	quality and error recovery
	4. paper replication is judged by reproducing the logic and outcome of the
	experiment, not by line-by-line paraphrase of the source paper

	This is an environment redesign and should be implemented as a separate phase,
	not mixed into training-metric changes silently.

	## Guardrails

	1. no hallucinated resources, tools, measurements, or outcomes
	2. no hidden ground-truth leakage into model prompts
	3. no live-web reward dependence
	4. deterministic reward remains the training source of truth
	5. before/after graphs must stay comparable across runs