Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / ReplicaLab_Blueprint.md

maxxie114

Initial HF Spaces deployment

80d8c84 1 day ago

preview code

raw

history blame contribute delete

15.6 kB

ReplicaLab

A multi-agent scientific replication environment built on OpenEnv

Overview

ReplicaLab is a virtual scientific replication world. Each episode generates an original experiment and a constrained lab, then two agents negotiate a replication plan:

A Scientist agent that protects scientific validity.
A Lab Manager agent that protects cost, equipment, time, staffing, and feasibility.

They negotiate over multiple rounds. If they converge on a sound, feasible protocol, the episode yields a high reward. If they fail, overspend, or strip away critical scientific elements, the reward stays low.

The real-world motivation is the replication crisis: published protocols describe ideal conditions, but real labs face missing tools, tight budgets, booking conflicts, reagent shortages, and limited personnel. ReplicaLab trains an agent to answer a single question:

How do we adapt an experiment without breaking the science?

Hackathon Track Alignment

ReplicaLab touches four of the five OpenEnv Hackathon problem statements.

Primary Tracks

Track	Fit
Multi-Agent Interactions	Two roles hold different private information and must negotiate toward consensus. Strongest fit.
World Modeling (Professional)	The agent reasons inside a professional world with hidden constraints. Very strong fit.

Supporting Tracks

Track	Fit
Long-Horizon Planning	The agent must ask, revise, recover, and converge across multiple rounds rather than solving in one step.
Self-Improvement	The same environment trains the Scientist so its behavior improves over repeated episodes.

Demo framing: Lead with Multi-Agent + World Modeling. Support with Long-Horizon + Self-Improvement.

Why This Is an Environment

ReplicaLab is not a prompt. It satisfies all five properties of a proper environment:

State — Current paper, lab constraints, round number, negotiation history, proposed protocol, spent budget, remaining stock, done flag.
Actions — The Scientist can propose, revise, ask questions, or accept. The Lab Manager can report feasibility, suggest substitutions, reject, or accept.
Transitions — Each action mutates the world: budget consumed, protocol updated, round counter incremented, dialogue history extended.
Observations — Each role sees a different partial view of the world (partially observable).
Reward — The environment scores the quality of the final plan.

OpenEnv provides exactly this pattern: typed Action, Observation, and State models with reset(), step(), and state() methods, wrapped in FastAPI + WebSocket serving with per-session instances.

Episode Lifecycle

A single episode unfolds as follows:

Reset — reset(seed=42) creates one paper template, one lab constraint set, and one hidden evaluation rubric.
Scientist observes — Paper summary, experiment goal, conversation history, current proposed protocol.
Lab Manager observes — Budget, equipment, booking calendar, reagents, staff, safety rules, current proposal.
Scientist acts — Proposes, revises, asks, or accepts.
Lab Manager responds — Reports feasibility, suggests substitutions, or accepts.
State updates — Environment transitions.
Repeat for a fixed number of rounds or until both sides accept (or timeout).
Reward returned — The environment scores the final protocol.

Key Design Decision

For the MVP, only the Scientist is trained.

Role	Implementation
Scientist	Trainable LLM policy
Lab Manager	Deterministic rule-based policy with readable responses
Judge	Deterministic rubric engine, with optional LLM explanation layer

This gives stable environment dynamics and clean reward signals for a hackathon setting.

The Three Roles

A. Scientist Agent

The Scientist protects scientific quality. It reasons about essential controls, safe sample-size reductions, valid substitutions, and the minimum viable version of an experiment that still tests the claim.

Action schema:

{
  "action_type": "propose_protocol | revise_protocol | request_info | accept",
  "sample_size": 60,
  "controls": ["vehicle_control", "positive_control"],
  "technique": "WST1",
  "duration_days": 7,
  "required_equipment": ["plate_reader", "incubator"],
  "required_reagents": ["drug_A", "WST1_kit"],
  "questions": ["Do we have a plate reader free this week?"],
  "rationale": "WST1 is an acceptable substitute for MTT in this template"
}

B. Lab Manager Agent

The Lab Manager protects feasibility: budget, equipment availability, machine bookings, reagent delivery timelines, and staffing. For the MVP this is a rule-based system (deterministic constraint checker, substitution suggester, cost estimator, booking checker, natural-language response template) to keep environment behavior stable and debuggable.

C. Judge

The Judge is a rubric-backed scorer, not a free-form LLM.

It receives the original paper, hidden minimum-viable replication spec, final proposed protocol, actual lab constraints, and negotiation transcript. It outputs:

Rigor score
Feasibility score
Fidelity score
Final reward
Audit notes

An optional LLM explanation layer can translate the audit into readable notes for the UI.

Reward Structure

Core Dimensions

Dimension	What It Measures	Examples
Rigor	Did the agent preserve the important science?	Sample size, controls, method validity, statistics, duration
Feasibility	Can this lab actually run the plan?	Budget, equipment availability, stock, timeline, staffing
Fidelity	How close is the plan to the original experiment?	Same technique or valid substitute, same control logic, similar sample size, same study aim

Formula

total_reward = 10 × rigor × feasibility × fidelity
             + efficiency_bonus
             + communication_bonus
             − penalties

The multiplicative core prevents fake wins: a scientifically perfect but impossible plan scores low, and a cheap but scientifically broken plan also scores low.

Penalties

Applied for timeout, exceeding budget, invalid structure, missing critical controls, and bad substitutions.

Reinforcement Learning

RL improves the Scientist policy.

Environment resets.
Scientist generates an action.
Lab Manager replies.
Episode ends with a reward.
Training loop adjusts the Scientist toward higher-reward behaviors.

Target behaviors over training:

Ask better questions before committing.
Preserve critical controls.
Choose realistic substitutions.
Reach agreement faster.
Avoid over-budget plans.

TRL supports OpenEnv-style training through a custom rollout_func for stepping through an environment with environment-computed rewards. GRPO supports custom reward functions. Unsloth provides GRPO notebooks designed for this kind of training.

Self-Improvement

For the MVP, self-improvement means the Scientist gets measurably better through repeated episodes. That is sufficient for the track.

Stretch goals (time permitting):

Curriculum learning — Easy scenarios first, then medium, then hard.
Self-critique — After a failed episode, the agent reviews a short audit and retries.
Self-play — Train both Scientist and Lab Manager.

World Modeling and Long-Horizon Planning

World Modeling

The agent must build an internal model of a hidden world: what the lab has, what it lacks, what is booked, what is scientifically critical, what is flexible, and how choices affect future feasibility. None of this is fully visible, so the agent infers the world through negotiation.

Long-Horizon Planning

The best move is rarely the first move. A strong Scientist follows a chain: understand the paper goal, ask what is available, propose a first plan, revise after constraints surface, trade off cost against rigor, and reach agreement before timeout. That is multi-step planning, not a single answer.

Constraint System

Constraints come from a scenario generator. Each scenario template defines required equipment, optional substitutes, must-keep controls, minimum sample size, minimum duration, typical costs, and likely bottlenecks. Difficulty modifies them:

Difficulty	Description
Easy	Lab has most of what is needed.
Medium	Some missing items, tighter budget, tighter time.
Hard	Major shortages, bigger tradeoffs, booking conflicts.

For the MVP, the world is deterministic within each episode: the initial seed defines the entire scenario, resources change only through agent choices, and there are no random surprise events. This makes debugging, replay, and demo presentations much stronger.

Interface Design

Layout

Section	Content
Left Panel	Original paper summary, challenge label, seed, round counter
Middle Panel	Negotiation log (Scientist in blue, Lab Manager in green, Judge audit at end)
Right Panel	Current proposed protocol, lab inventory snapshot, budget bar, score bars for rigor/feasibility/fidelity
Bottom Controls	New episode, seed selector, scenario selector, replay slider, before-vs-after training toggle

Implementation

Demo UI: Custom React + Vite app hitting the FastAPI + WebSocket backend.
Fallback UI: OpenEnv built-in /web interface.

Folder Structure

replicalab/
├── README.md
├── pyproject.toml
├── openenv.yaml
├── .dockerignore
├── replicalab/
│   ├── __init__.py
│   ├── models.py
│   ├── client.py
│   ├── prompts/
│   │   ├── scientist.txt
│   │   ├── lab_manager.txt
│   │   └── judge.txt
│   ├── scenarios/
│   │   ├── templates.py
│   │   ├── cell_biology.py
│   │   ├── ml_benchmark.py
│   │   └── behavioral_psych.py
│   ├── scoring/
│   │   ├── rubric.py
│   │   ├── rigor.py
│   │   ├── feasibility.py
│   │   └── fidelity.py
│   ├── agents/
│   │   ├── scientist_policy.py
│   │   ├── lab_manager_policy.py
│   │   └── judge_policy.py
│   ├── env/
│   │   └── replicalab_env.py
│   └── utils/
│       ├── seed.py
│       ├── validation.py
│       └── logging.py
├── server/
│   ├── app.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── package.json
│   ├── vite.config.ts
│   └── src/
│       ├── App.tsx
│       ├── components/
│       └── pages/
├── notebooks/
│   └── train_colab.ipynb
└── tests/
    ├── test_env.py
    ├── test_reward.py
    ├── test_scenarios.py
    └── test_server.py

Toolchain

Tool	Purpose
OpenEnv 0.2.1	Environment class and server
Hugging Face Spaces	Public hosting (Docker SDK, port 7860)
Docker	Packaging server + frontend
Google Colab	Required training notebook
TRL / Unsloth	RL training on the Scientist
FastAPI + WebSocket	Live environment serving
React + Vite	Frontend
Tailwind + shadcn/ui	Styling
Matplotlib	Reward curves in Colab
CSV / JSONL logs	Replay and debugging

Scope

In Scope (MVP)

One working OpenEnv environment
Three scenario templates (Cell Biology, ML Benchmark, Behavioral Psychology)
Trainable Scientist agent
Rule-based Lab Manager
Judge rubric engine
Reward logging
HF Space deployment
Colab RL notebook with reward curve
Public repo
One-minute YouTube demo
Clean README
React UI or polished /web fallback

Stretch (Only If Ahead)

LLM Lab Manager
Live replay mode
Side-by-side before-vs-after comparison
More scenario families
Judge explanation LLM
Curriculum learning

Out of Scope

Proving a real paper is true or false
Parsing arbitrary papers from the internet
Full autonomous lab automation
Real wet-lab execution
Full multi-model self-play
Enterprise workflow integrations

Team Roles (4 People)

Person	Ownership
P1: Environment + Reward	Scenario engine, environment state, constraint logic, reward logic, tests
P2: RL + Model	Scientist policy prompt, TRL/Unsloth notebook, rollout loop, reward curves, before/after evaluation
P3: Backend + Deploy	FastAPI, WebSocket, Docker, HF Space, logging, replay API
P4: Frontend + Story	React/Vite UI, visualization, demo flow, README, YouTube demo

Everyone shares bug fixing, testing, and final polish.

Build Sequence

Freeze the environment schema
Implement one scenario end to end
Add reward and logs
Add rule-based Lab Manager
Add Scientist baseline
Connect Colab training
Add React UI
Deploy to HF
Record demo
Write README

Judging Criteria and Demo Strategy

Criterion (Weight)	How ReplicaLab Scores
Environment Innovation (40%)	Partially observable, multi-role scientific negotiation world, not a toy chat task.
Storytelling (30%)	Scientist vs. Lab Manager is instantly understandable.
Training Improvement (20%)	Same seed, before training vs. after training, visible reward improvement.
Pipeline Setup (10%)	Clean reward formula, structured logs, reproducible Colab notebook.

Demo Flow

New episode with a specific seed.
Paper appears, Scientist proposes.
Lab Manager pushes back.
Negotiation unfolds over rounds.
Judge shows final scores.
Replay same seed with the trained model.
Trained model asks smarter questions, avoids bad substitutions, earns higher reward.

Success Metrics

Metric	Untrained Scientist	Trained Scientist
Average reward	Lower	Higher
Rounds to agreement	More	Fewer
Invalid action rate	Higher	Lower
Agreement rate	Lower	Higher

Sponsor Alignment

Target	Rationale
Halluminate	True multi-actor environment with different beliefs and information per role.
Snorkel AI	Simulated experts in the loop; the Scientist learns by interacting with expert-style roles.
Fleet AI (alternate)	Judge as an explicit oversight layer monitoring and explaining the two agents.

Real-World Applications

Target users: Biotech teams, pharma R&D groups, contract research organizations, university labs, cloud lab platforms, AI labs training scientific agents.

Potential revenue paths: Enterprise experiment planning software, evaluation benchmark licensing, simulation API access, experiment design copilot products.

The Simple Explanation

Imagine two kids want to bake a cake. One knows the recipe. The other knows what is in the kitchen. The recipe kid says they need eggs, milk, flour, and chocolate. The kitchen kid says there is no chocolate, but there is cocoa. They talk and make the best cake they can. If the cake stays tasty, uses what the kitchen has, and finishes on time, they earn a star.

ReplicaLab is that, but for science.