Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /future_improvements.md

maxxie114

Initial HF Spaces deployment

80d8c84 2 days ago

preview code

raw

history blame contribute delete

12.9 kB

	# Future Improvements

	Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`

	This document tracks post-MVP architectural improvements. Work here begins
	only after the core logic is complete and the hackathon deliverables are
	stable.

	---

	## 1. Domain-Agnostic Normalized Scenario Layer

	### Priority: highest future feature

	### Problem

	The current models in `replicalab/models.py` use domain-biased field names:

	- `paper_title`, `paper_hypothesis`, `paper_method`, `paper_key_finding`
	- `equipment_available`, `reagents_in_stock`, `staff_count`
	- `sample_size`, `controls`, `technique`

	These work for the three MVP scenario families (cell biology, ML benchmark,
	behavioral psychology) because all three map onto a lab-style replication
	frame. But if the environment needs to support domains outside scientific
	replication (e.g., engineering design, clinical trial planning, supply chain
	optimization), the field names stop making sense.

	The turn protocol itself (`propose`, `revise`, `request_info`, `accept`) is
	already generic. The gap is in the observation and protocol content layer.

	### Solution: normalized scenario representation

	Introduce a structured internal representation that any domain adapter can
	emit:

	```python
	class NormalizedScenarioPack(BaseModel):
	domain_id: str # "cell_biology", "ml_benchmark", etc.
	task_summary: str # what the agent is trying to achieve
	success_criteria: list[str] # measurable conditions for success
	constraints: list[Constraint] # budget, time, equipment, policy, etc.
	resources: list[Resource] # what is available to work with
	allowed_substitutions: list[Substitution] # valid swaps the agent can propose
	hidden_reference_spec: dict # ground truth the judge scores against
	difficulty: str # "easy", "medium", "hard"
	metadata: dict # domain-specific extras
	```

	Where:

	```python
	class Constraint(BaseModel):
	dimension: str # "budget", "time", "equipment", "personnel", "safety"
	label: str # human-readable name
	value: Any # the constraint value (numeric, list, etc.)
	hard: bool = True # hard constraint vs soft preference

	class Resource(BaseModel):
	category: str # "equipment", "reagent", "compute", "personnel"
	name: str # resource identifier
	available: bool # currently available
	quantity: Optional[int] # count if applicable
	notes: str = "" # booking conflicts, expiry, etc.

	class Substitution(BaseModel):
	original: str # what the reference spec uses
	replacement: str # what the agent can use instead
	quality_impact: float # 0.0 to 1.0, how much fidelity is lost
	cost_delta: float # cost difference
	```

	### Architecture principle

	```
	Domain template
	-> Scenario adapter (thin mapper, <50 lines per domain)
	-> NormalizedScenarioPack
	-> Observation mapper (fills ScientistObservation / LabManagerObservation)
	-> Prompt assembler (data-driven, not hard-coded)
	-> Validator (checks action against constraints)
	-> Scorer (compares final protocol against hidden_reference_spec)
	```

	The external contract (`ScientistAction`, `LabManagerAction`,
	`ScientistObservation`, `LabManagerObservation`, `StepResult`) stays
	unchanged. The normalization lives below those models as an internal
	implementation layer.

	LLMs reason and negotiate. They never own truth. Truth lives in the
	normalized scenario pack and the deterministic scorer.

	### How this affects the future core logic

	\| Current component \| Impact \| Severity \|
	\|---\|---\|---\|
	\| `replicalab/models.py` \| External contract unchanged. Add `NormalizedScenarioPack` and helper models as new classes \| Low \|
	\| `replicalab/scenarios/templates.py` (SCN 02) \| Must define the normalized schema. `generate_scenario()` returns a pack instead of raw dicts \| High \|
	\| `replicalab/scenarios/*.py` (SCN 03-05) \| Each domain file becomes a thin scenario adapter that emits a normalized pack \| Medium \|
	\| `replicalab/scenarios/templates.py` (SCN 06) \| Difficulty scaling becomes mechanical: add/remove constraints, tighten resource limits \| Medium, but simpler \|
	\| `replicalab/scenarios/templates.py` (SCN 07) \| Constraint generator emits `Constraint` objects instead of ad hoc lab fields \| High \|
	\| `replicalab/scenarios/templates.py` (SCN 08) \| `hidden_reference_spec` is part of the pack, not a separate hidden structure \| Medium \|
	\| `replicalab/utils/validation.py` (MOD 05-06) \| Validators read `constraints[]` and `resources[]` from the pack instead of checking lab-specific fields \| High \|
	\| `replicalab/scoring/*.py` (JDG 01-04) \| Scorers compare the final protocol against `hidden_reference_spec` on normalized dimensions \| High \|
	\| `replicalab/env/replicalab_env.py` (ENV 01-07) \| `EpisodeState` gains a `scenario_pack` field. Reset populates it from the adapter \| Medium \|
	\| `replicalab/agents/scientist_policy.py` (AGT 01-02) \| Prompts assembled from scenario pack data, not hard-coded domain text \| Medium \|
	\| `replicalab/agents/lab_manager_policy.py` (AGT 05-07) \| Feasibility checker reads normalized constraints instead of lab-specific fields \| Medium \|
	\| `frontend/` (UI 01+) \| Render "constraint cards" and "resource cards" instead of lab-specific panels \| Low (future) \|

	### What stays the same

	- The turn protocol (`propose`, `revise`, `request_info`, `accept`)
	- The reward formula (`10 * rigor * feasibility * fidelity + bonuses - penalties`)
	- The external API contract (REST + WebSocket payloads)
	- The training loop and RL pipeline
	- The deterministic reward principle

	---

	## 2. Planned work items for the normalized scenario layer

	### Item 1: Define the normalized scenario schema

	What: Add `NormalizedScenarioPack`, `Constraint`, `Resource`, and
	`Substitution` as Pydantic models in a new file
	`replicalab/scenarios/schema.py`.

	Why: This is the foundation. Every other item depends on having a stable
	schema that all adapters, validators, and scorers agree on.

	Depends on: Core MVP scenario work (SCN 02-09) being complete so we know
	what fields the adapters actually need.

	Scope: ~80 lines of model definitions, no business logic.

	---

	### Item 2: Convert existing scenario templates into adapters

	What: Refactor `cell_biology.py`, `ml_benchmark.py`, and
	`behavioral_psych.py` so each one returns a `NormalizedScenarioPack` instead
	of raw domain-specific dicts.

	Why: Proves the schema works for all three MVP domains. If a field cannot
	be cleanly mapped, the schema needs revision before adding new domains.

	Depends on: Item 1 (schema exists), SCN 03-05 (domain templates exist).

	Scope: ~50 lines per adapter. Should be thin mappers. If an adapter
	exceeds 50 lines, the schema is wrong.

	Constraint: The existing observation fields (`paper_title`,
	`equipment_available`, etc.) must still be populated. The adapter fills
	both the normalized pack and the legacy observation slots until the
	observation models are generalized.

	---

	### Item 3: Build data-driven prompt assembly

	What: Replace hard-coded prompt text with a template that assembles from
	the scenario pack:

	```
	You are a {role} working on: {task_summary}

	Success criteria:
	{success_criteria[]}

	You must work within these constraints:
	{constraints[].label}: {constraints[].value}

	Available resources:
	{resources[].name} ({resources[].category}): {available/unavailable}
	```

	Why: Makes AGT 01 (Scientist prompt) and AGT 07 (Lab Manager templates)
	domain-neutral. Adding a new domain requires only a new adapter, not new
	prompts.

	Depends on: Item 2 (adapters produce normalized packs), AGT 01 and
	AGT 07 existing in their MVP form.

	Scope: One prompt template function per role. ~40 lines each.

	---

	### Item 4: Hybrid LLM Lab Manager with deterministic post-checking

	What: Replace the rule-based Lab Manager with a hybrid architecture:

	1. LLM receives the `LabManagerObservation` and generates negotiation text
	plus alternative suggestions in natural language
	2. Deterministic constraint checker computes the real feasibility flags by
	reading the normalized scenario pack's `constraints[]` and `resources[]`
	3. A composer merges the LLM output with the checker output into a valid
	`LabManagerAction`
	4. The `model_validator` on `LabManagerAction` catches any inconsistency

	Why: Gives the Lab Manager realistic negotiation language and creative
	suggestions (the LLM's strength) while keeping feasibility flags truthful
	(the checker's strength). Training reward stays deterministic because the
	reward engine only reads the validated action, not the LLM's raw text.

	Depends on: Item 2 (checker needs normalized constraints), AGT 05
	(feasibility checker exists), MOD 02 (LabManagerAction validators exist).

	Scope: ~120 lines. The LLM call, the checker, the composer. Uses the
	same base model as the Scientist (Qwen3-4B) with a separate role adapter.

	Risk: Episode variance increases because the same seed may produce
	different negotiation paths. Mitigate by keeping the deterministic checker as
	the authority on all boolean flags. The LLM only controls `explanation` text
	and suggestion ideas, never the truth flags.

	---

	### Item 5: Normalized scoring against hidden reference spec

	What: Refactor the scoring engine so `score_rigor()`,
	`score_feasibility()`, and `score_fidelity()` compare the final protocol
	against `hidden_reference_spec` from the normalized scenario pack instead of
	using domain-specific scoring logic.

	Scoring dimensions become:

	- Rigor: Does the protocol preserve the success criteria? Compare
	`protocol.controls` against `hidden_reference_spec.required_controls`,
	check sample size ratio, verify statistical validity markers.
	- Feasibility: Does the protocol satisfy all hard constraints? Walk
	`constraints[]` and check each one against the protocol.
	- Fidelity: How close is the protocol to the reference spec? Compare
	technique, duration, equipment, reagents against
	`hidden_reference_spec` and compute a similarity score using
	`allowed_substitutions[]` quality impact.

	Why: Makes scoring work for any domain without per-domain scorer code.
	The domain-specific knowledge lives in the scenario adapter (which defines
	what the reference spec and constraints are), not in the scoring engine.

	Depends on: Item 1 (schema with `hidden_reference_spec`), Item 2
	(adapters populate it), JDG 01-04 (MVP scorers exist to refactor from).

	Scope: Refactor of existing scorer files. ~150 lines total across
	`rigor.py`, `feasibility.py`, `fidelity.py`.

	---

	### Item 6: Lab Manager orchestrator with specialist subagents

	What: Decompose the hybrid Lab Manager into a coordinator that delegates
	to specialist subagents:

	\| Subagent \| Responsibility \|
	\|---\|---\|
	\| Budget agent \| Checks cost against remaining budget \|
	\| Scheduling agent \| Checks timeline and booking conflicts \|
	\| Equipment agent \| Checks equipment availability and substitutions \|
	\| Safety agent \| Checks policy and compliance constraints \|
	\| Coordinator \| Aggregates subagent outputs into one `LabManagerAction` \|

	Externally, the contract is unchanged: one `LabManagerAction` per turn. The
	orchestration is internal.

	Why: Stronger multi-agent story for the hackathon track alignment.
	Demonstrates that the Lab Manager is not a monolithic policy but a team of
	constraint specialists. Each subagent can be individually tested, improved,
	or replaced.

	Depends on: Item 4 (hybrid Lab Manager works first), Item 2 (normalized
	constraints are available for each subagent to read).

	Scope: Orchestration layer ~200 lines. Each subagent ~40 lines. Total
	~400 lines.

	Risk: Adds latency (multiple LLM calls or multiple checker passes per
	turn), orchestration failure handling, and logging complexity. Only pursue
	after the single hybrid Lab Manager is stable and training is producing
	results.

	Phasing: This is the lowest priority item. Build it only if the MVP is
	complete, training shows improvement, and there is time remaining before
	submission.

	---

	## 3. Recommended order

	\| Order \| Item \| Gate \|
	\|---\|---\|---\|
	\| 1 \| Define normalized scenario schema \| After SCN 02-09 complete \|
	\| 2 \| Convert templates to adapters \| After Item 1 \|
	\| 3 \| Data-driven prompt assembly \| After Item 2 + AGT 01/07 \|
	\| 4 \| Hybrid LLM Lab Manager \| After Item 2 + AGT 05 \|
	\| 5 \| Normalized scoring \| After Item 2 + JDG 01-04 \|
	\| 6 \| Lab Manager orchestrator with subagents \| After Item 4 stable \|

	---

	## 4. Key principle

	The external contract stays stable. Internal policy can evolve. LLMs reason
	and negotiate. They never own truth. Truth lives in the normalized scenario
	pack and the deterministic scorer.