Spaces:
Running
Running
| # Future Improvements | |
| Source of truth: `ReplicaLab_Comprehensive_Task_Division.md` | |
| This document tracks post-MVP architectural improvements. Work here begins | |
| only after the core logic is complete and the hackathon deliverables are | |
| stable. | |
| --- | |
| ## 1. Domain-Agnostic Normalized Scenario Layer | |
| ### Priority: highest future feature | |
| ### Problem | |
| The current models in `replicalab/models.py` use domain-biased field names: | |
| - `paper_title`, `paper_hypothesis`, `paper_method`, `paper_key_finding` | |
| - `equipment_available`, `reagents_in_stock`, `staff_count` | |
| - `sample_size`, `controls`, `technique` | |
| These work for the three MVP scenario families (cell biology, ML benchmark, | |
| behavioral psychology) because all three map onto a lab-style replication | |
| frame. But if the environment needs to support domains outside scientific | |
| replication (e.g., engineering design, clinical trial planning, supply chain | |
| optimization), the field names stop making sense. | |
| The turn protocol itself (`propose`, `revise`, `request_info`, `accept`) is | |
| already generic. The gap is in the observation and protocol content layer. | |
| ### Solution: normalized scenario representation | |
| Introduce a structured internal representation that any domain adapter can | |
| emit: | |
| ```python | |
| class NormalizedScenarioPack(BaseModel): | |
| domain_id: str # "cell_biology", "ml_benchmark", etc. | |
| task_summary: str # what the agent is trying to achieve | |
| success_criteria: list[str] # measurable conditions for success | |
| constraints: list[Constraint] # budget, time, equipment, policy, etc. | |
| resources: list[Resource] # what is available to work with | |
| allowed_substitutions: list[Substitution] # valid swaps the agent can propose | |
| hidden_reference_spec: dict # ground truth the judge scores against | |
| difficulty: str # "easy", "medium", "hard" | |
| metadata: dict # domain-specific extras | |
| ``` | |
| Where: | |
| ```python | |
| class Constraint(BaseModel): | |
| dimension: str # "budget", "time", "equipment", "personnel", "safety" | |
| label: str # human-readable name | |
| value: Any # the constraint value (numeric, list, etc.) | |
| hard: bool = True # hard constraint vs soft preference | |
| class Resource(BaseModel): | |
| category: str # "equipment", "reagent", "compute", "personnel" | |
| name: str # resource identifier | |
| available: bool # currently available | |
| quantity: Optional[int] # count if applicable | |
| notes: str = "" # booking conflicts, expiry, etc. | |
| class Substitution(BaseModel): | |
| original: str # what the reference spec uses | |
| replacement: str # what the agent can use instead | |
| quality_impact: float # 0.0 to 1.0, how much fidelity is lost | |
| cost_delta: float # cost difference | |
| ``` | |
| ### Architecture principle | |
| ``` | |
| Domain template | |
| -> Scenario adapter (thin mapper, <50 lines per domain) | |
| -> NormalizedScenarioPack | |
| -> Observation mapper (fills ScientistObservation / LabManagerObservation) | |
| -> Prompt assembler (data-driven, not hard-coded) | |
| -> Validator (checks action against constraints) | |
| -> Scorer (compares final protocol against hidden_reference_spec) | |
| ``` | |
| The external contract (`ScientistAction`, `LabManagerAction`, | |
| `ScientistObservation`, `LabManagerObservation`, `StepResult`) stays | |
| unchanged. The normalization lives below those models as an internal | |
| implementation layer. | |
| LLMs reason and negotiate. They never own truth. Truth lives in the | |
| normalized scenario pack and the deterministic scorer. | |
| ### How this affects the future core logic | |
| | Current component | Impact | Severity | | |
| |---|---|---| | |
| | `replicalab/models.py` | External contract unchanged. Add `NormalizedScenarioPack` and helper models as new classes | Low | | |
| | `replicalab/scenarios/templates.py` (SCN 02) | Must define the normalized schema. `generate_scenario()` returns a pack instead of raw dicts | High | | |
| | `replicalab/scenarios/*.py` (SCN 03-05) | Each domain file becomes a thin scenario adapter that emits a normalized pack | Medium | | |
| | `replicalab/scenarios/templates.py` (SCN 06) | Difficulty scaling becomes mechanical: add/remove constraints, tighten resource limits | Medium, but simpler | | |
| | `replicalab/scenarios/templates.py` (SCN 07) | Constraint generator emits `Constraint` objects instead of ad hoc lab fields | High | | |
| | `replicalab/scenarios/templates.py` (SCN 08) | `hidden_reference_spec` is part of the pack, not a separate hidden structure | Medium | | |
| | `replicalab/utils/validation.py` (MOD 05-06) | Validators read `constraints[]` and `resources[]` from the pack instead of checking lab-specific fields | High | | |
| | `replicalab/scoring/*.py` (JDG 01-04) | Scorers compare the final protocol against `hidden_reference_spec` on normalized dimensions | High | | |
| | `replicalab/env/replicalab_env.py` (ENV 01-07) | `EpisodeState` gains a `scenario_pack` field. Reset populates it from the adapter | Medium | | |
| | `replicalab/agents/scientist_policy.py` (AGT 01-02) | Prompts assembled from scenario pack data, not hard-coded domain text | Medium | | |
| | `replicalab/agents/lab_manager_policy.py` (AGT 05-07) | Feasibility checker reads normalized constraints instead of lab-specific fields | Medium | | |
| | `frontend/` (UI 01+) | Render "constraint cards" and "resource cards" instead of lab-specific panels | Low (future) | | |
| ### What stays the same | |
| - The turn protocol (`propose`, `revise`, `request_info`, `accept`) | |
| - The reward formula (`10 * rigor * feasibility * fidelity + bonuses - penalties`) | |
| - The external API contract (REST + WebSocket payloads) | |
| - The training loop and RL pipeline | |
| - The deterministic reward principle | |
| --- | |
| ## 2. Planned work items for the normalized scenario layer | |
| ### Item 1: Define the normalized scenario schema | |
| **What:** Add `NormalizedScenarioPack`, `Constraint`, `Resource`, and | |
| `Substitution` as Pydantic models in a new file | |
| `replicalab/scenarios/schema.py`. | |
| **Why:** This is the foundation. Every other item depends on having a stable | |
| schema that all adapters, validators, and scorers agree on. | |
| **Depends on:** Core MVP scenario work (SCN 02-09) being complete so we know | |
| what fields the adapters actually need. | |
| **Scope:** ~80 lines of model definitions, no business logic. | |
| --- | |
| ### Item 2: Convert existing scenario templates into adapters | |
| **What:** Refactor `cell_biology.py`, `ml_benchmark.py`, and | |
| `behavioral_psych.py` so each one returns a `NormalizedScenarioPack` instead | |
| of raw domain-specific dicts. | |
| **Why:** Proves the schema works for all three MVP domains. If a field cannot | |
| be cleanly mapped, the schema needs revision before adding new domains. | |
| **Depends on:** Item 1 (schema exists), SCN 03-05 (domain templates exist). | |
| **Scope:** ~50 lines per adapter. Should be thin mappers. If an adapter | |
| exceeds 50 lines, the schema is wrong. | |
| **Constraint:** The existing observation fields (`paper_title`, | |
| `equipment_available`, etc.) must still be populated. The adapter fills | |
| both the normalized pack and the legacy observation slots until the | |
| observation models are generalized. | |
| --- | |
| ### Item 3: Build data-driven prompt assembly | |
| **What:** Replace hard-coded prompt text with a template that assembles from | |
| the scenario pack: | |
| ``` | |
| You are a {role} working on: {task_summary} | |
| Success criteria: | |
| {success_criteria[]} | |
| You must work within these constraints: | |
| {constraints[].label}: {constraints[].value} | |
| Available resources: | |
| {resources[].name} ({resources[].category}): {available/unavailable} | |
| ``` | |
| **Why:** Makes AGT 01 (Scientist prompt) and AGT 07 (Lab Manager templates) | |
| domain-neutral. Adding a new domain requires only a new adapter, not new | |
| prompts. | |
| **Depends on:** Item 2 (adapters produce normalized packs), AGT 01 and | |
| AGT 07 existing in their MVP form. | |
| **Scope:** One prompt template function per role. ~40 lines each. | |
| --- | |
| ### Item 4: Hybrid LLM Lab Manager with deterministic post-checking | |
| **What:** Replace the rule-based Lab Manager with a hybrid architecture: | |
| 1. LLM receives the `LabManagerObservation` and generates negotiation text | |
| plus alternative suggestions in natural language | |
| 2. Deterministic constraint checker computes the real feasibility flags by | |
| reading the normalized scenario pack's `constraints[]` and `resources[]` | |
| 3. A composer merges the LLM output with the checker output into a valid | |
| `LabManagerAction` | |
| 4. The `model_validator` on `LabManagerAction` catches any inconsistency | |
| **Why:** Gives the Lab Manager realistic negotiation language and creative | |
| suggestions (the LLM's strength) while keeping feasibility flags truthful | |
| (the checker's strength). Training reward stays deterministic because the | |
| reward engine only reads the validated action, not the LLM's raw text. | |
| **Depends on:** Item 2 (checker needs normalized constraints), AGT 05 | |
| (feasibility checker exists), MOD 02 (LabManagerAction validators exist). | |
| **Scope:** ~120 lines. The LLM call, the checker, the composer. Uses the | |
| same base model as the Scientist (Qwen3-4B) with a separate role adapter. | |
| **Risk:** Episode variance increases because the same seed may produce | |
| different negotiation paths. Mitigate by keeping the deterministic checker as | |
| the authority on all boolean flags. The LLM only controls `explanation` text | |
| and suggestion ideas, never the truth flags. | |
| --- | |
| ### Item 5: Normalized scoring against hidden reference spec | |
| **What:** Refactor the scoring engine so `score_rigor()`, | |
| `score_feasibility()`, and `score_fidelity()` compare the final protocol | |
| against `hidden_reference_spec` from the normalized scenario pack instead of | |
| using domain-specific scoring logic. | |
| Scoring dimensions become: | |
| - **Rigor:** Does the protocol preserve the success criteria? Compare | |
| `protocol.controls` against `hidden_reference_spec.required_controls`, | |
| check sample size ratio, verify statistical validity markers. | |
| - **Feasibility:** Does the protocol satisfy all hard constraints? Walk | |
| `constraints[]` and check each one against the protocol. | |
| - **Fidelity:** How close is the protocol to the reference spec? Compare | |
| technique, duration, equipment, reagents against | |
| `hidden_reference_spec` and compute a similarity score using | |
| `allowed_substitutions[]` quality impact. | |
| **Why:** Makes scoring work for any domain without per-domain scorer code. | |
| The domain-specific knowledge lives in the scenario adapter (which defines | |
| what the reference spec and constraints are), not in the scoring engine. | |
| **Depends on:** Item 1 (schema with `hidden_reference_spec`), Item 2 | |
| (adapters populate it), JDG 01-04 (MVP scorers exist to refactor from). | |
| **Scope:** Refactor of existing scorer files. ~150 lines total across | |
| `rigor.py`, `feasibility.py`, `fidelity.py`. | |
| --- | |
| ### Item 6: Lab Manager orchestrator with specialist subagents | |
| **What:** Decompose the hybrid Lab Manager into a coordinator that delegates | |
| to specialist subagents: | |
| | Subagent | Responsibility | | |
| |---|---| | |
| | Budget agent | Checks cost against remaining budget | | |
| | Scheduling agent | Checks timeline and booking conflicts | | |
| | Equipment agent | Checks equipment availability and substitutions | | |
| | Safety agent | Checks policy and compliance constraints | | |
| | Coordinator | Aggregates subagent outputs into one `LabManagerAction` | | |
| Externally, the contract is unchanged: one `LabManagerAction` per turn. The | |
| orchestration is internal. | |
| **Why:** Stronger multi-agent story for the hackathon track alignment. | |
| Demonstrates that the Lab Manager is not a monolithic policy but a team of | |
| constraint specialists. Each subagent can be individually tested, improved, | |
| or replaced. | |
| **Depends on:** Item 4 (hybrid Lab Manager works first), Item 2 (normalized | |
| constraints are available for each subagent to read). | |
| **Scope:** Orchestration layer ~200 lines. Each subagent ~40 lines. Total | |
| ~400 lines. | |
| **Risk:** Adds latency (multiple LLM calls or multiple checker passes per | |
| turn), orchestration failure handling, and logging complexity. Only pursue | |
| after the single hybrid Lab Manager is stable and training is producing | |
| results. | |
| **Phasing:** This is the lowest priority item. Build it only if the MVP is | |
| complete, training shows improvement, and there is time remaining before | |
| submission. | |
| --- | |
| ## 3. Recommended order | |
| | Order | Item | Gate | | |
| |---|---|---| | |
| | 1 | Define normalized scenario schema | After SCN 02-09 complete | | |
| | 2 | Convert templates to adapters | After Item 1 | | |
| | 3 | Data-driven prompt assembly | After Item 2 + AGT 01/07 | | |
| | 4 | Hybrid LLM Lab Manager | After Item 2 + AGT 05 | | |
| | 5 | Normalized scoring | After Item 2 + JDG 01-04 | | |
| | 6 | Lab Manager orchestrator with subagents | After Item 4 stable | | |
| --- | |
| ## 4. Key principle | |
| The external contract stays stable. Internal policy can evolve. LLMs reason | |
| and negotiate. They never own truth. Truth lives in the normalized scenario | |
| pack and the deterministic scorer. | |