Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / ReplicaLab_Master_Blueprint.md

maxxie114

Initial HF Spaces deployment

80d8c84 1 day ago

preview code

raw

history blame contribute delete

26.2 kB

ReplicaLab Master Blueprint

1. Executive summary

ReplicaLab is an OpenEnv based scientific replication environment.

In each episode, the system creates:

An original experiment or paper summary
A lab with real constraints such as budget, equipment, reagent stock, staffing, and time
A negotiation task where a Scientist agent and a Lab Manager agent must agree on a valid replication plan

The core idea is simple:

One agent knows what the science needs. One agent knows what the lab can actually do. They must negotiate a replication plan that is scientifically valid and realistically feasible.

This becomes a true environment because it has state, actions, observations, transitions, rewards, and episode termination. It is not just a chatbot prompt. It is a structured, trainable world.

2. The real world problem we are targeting

ReplicaLab targets the gap between ideal scientific protocols and real lab constraints.

In the real world, many experiments are hard to replicate because:

Papers describe ideal methods
Labs lack the full equipment or materials
Budgets and schedules are limited
Some substitutions are acceptable, but some break the science
Teams must decide what is essential and what can change

So the real question ReplicaLab asks is:

How do we adapt an experiment without breaking the science?

This is the practical version of the replication crisis problem.

3. One line pitch

ReplicaLab is an OpenEnv environment where a Scientist agent and a Lab Manager agent negotiate how to replicate scientific experiments under realistic lab constraints, and RL trains the Scientist to make better replication decisions over time.

4. Which hackathon tracks we are following

ReplicaLab touches 4 out of the 5 hackathon problem statements.

4.1 Primary tracks

A. Multi Agent Interactions

This is the strongest fit.

Why:

The Scientist and Lab Manager hold different private information
Neither can solve the task alone
They must negotiate, exchange information, and converge

B. World Modeling, Professional Tasks

This is the second strongest fit.

Why:

The environment simulates a real scientific workflow
The agent must reason inside a partially observable professional world
It must infer what the lab can and cannot do before making a good plan

4.2 Supporting tracks

C. Long Horizon Planning and Instruction Following

Why:

The task takes several rounds
The agent must ask, revise, recover from mistakes, and plan ahead
Reward is delayed until a protocol is good enough

D. Self Improvement

Why:

The same environment is used for RL training
The Scientist improves across repeated episodes
The environment supports curriculum and replay later on

4.3 Track summary

Tracks touched technically: 4

Tracks we should lead with in the pitch: 2

Multi Agent Interactions
World Modeling

Tracks we should mention as supporting evidence:

Long Horizon Planning
Self Improvement

5. Sponsor and partner alignment

5.1 Best sponsor fits

Halluminate

Best fit because ReplicaLab is a true multi actor environment.

The Scientist is one actor
The Lab Manager is another actor
The Judge can later act as a third oversight actor

Snorkel AI

Best fit because ReplicaLab behaves like simulated experts in the loop.

The Scientist acts like a domain expert
The Lab Manager acts like an operations expert
The learning model improves through repeated expert style interactions

5.2 Good optional fit

Fleet AI

This becomes stronger if the Judge is framed as an oversight agent that monitors, explains, and audits the decisions of the Scientist and Lab Manager.

5.3 Resource fit

Hugging Face for Spaces deployment and credits
Unsloth for RL notebooks and simpler training setup
Northflank for H100 access if faster training is needed
Cursor for coding speed only

6. Why this is truly an environment

ReplicaLab is an environment because it contains the full RL loop.

6.1 State

The state contains:

The paper or experiment description
The hidden minimum viable replication spec
The lab constraints
The round number
The negotiation history
The current proposed protocol
The current score state
Whether the episode is done

6.2 Actions

The Scientist can:

Propose a protocol
Revise a protocol
Request information
Accept

The Lab Manager can:

Report feasibility
Suggest alternatives
Reject
Accept

6.3 Observations

Each role sees a different view of the world.

The Scientist sees scientific requirements and negotiation state.

The Lab Manager sees operational constraints and negotiation state.

6.4 Transitions

Each step updates:

The conversation history
The current protocol
The round counter
Budget usage if needed
The done status if agreement happens or time runs out

6.5 Reward

The environment returns a score based on:

Scientific rigor
Feasibility
Fidelity to the original experiment

That is what makes it a trainable environment instead of a static task.

7. The core environment loop

7.1 One episode

reset(seed=42) creates a paper, a lab context, and a hidden evaluation rubric
The Scientist receives its observation
The Lab Manager receives its observation
The Scientist acts first
The Lab Manager responds
This repeats for up to a fixed number of rounds
If both accept, the episode ends successfully
If time runs out, the episode ends with a penalty
The Judge computes the final reward

7.2 Environment methods

The environment should implement:

reset()
step()
state()
close()

These are the core methods that make the system compatible with OpenEnv serving and RL rollouts.

8. Scenario environments inside ReplicaLab

For the MVP, we should use 3 scenario families.

8.1 MVP scenario families

A. Cell Biology

Example: Drug effect on cell proliferation using MTT or WST1 style assay

Why it is good:

Easy to explain
Has obvious lab constraints
Good match between rigor and feasibility tradeoffs

B. Machine Learning Benchmark Replication

Example: Reproducing a benchmark result with limited GPU budget and compute time

Why it is good:

Easier to simulate
Good for judges who understand ML
Strong world modeling story around compute, time, and reproducibility

C. Behavioral Psychology Survey Study

Example: Replicating a survey study with participant limits, time limits, and platform constraints

Why it is good:

Gives variety beyond wet lab work
Shows broader scientific replication use case
Easy to explain ethical and logistical constraints later on

8.2 Stretch scenario families

Biochemistry
Materials Science
Chemistry

9. How each model interacts with the others

9.1 Scientist agent

Role: Protect scientific validity

Knows:

The paper goal
Important methodological elements
Hidden scientific priorities through the environment design
The negotiation history

Does not directly know:

Full budget
Full inventory
Full equipment schedule
Full staffing details

Main job: Design a protocol that still counts as a meaningful replication.

9.2 Lab Manager agent

Role: Protect operational feasibility

Knows:

Budget
Equipment availability
Booking conflicts
Reagent stock
Personnel constraints
Safety restrictions
The negotiation history

Does not directly know:

Which scientific elements are absolutely critical
Which substitutions are scientifically acceptable unless told

Main job: Tell the Scientist what is actually possible and suggest realistic alternatives.

9.3 Judge agent

Role: Audit the final plan and score it

Knows:

Original paper summary
Minimum viable replication rubric
Final protocol
Actual constraints
Full conversation history

Main job: Compute the final reward and optionally explain it in plain English.

10. How the agents should be implemented

10.1 MVP implementation choice

For the hackathon MVP:

Scientist should be the only trained LLM policy
Lab Manager should be rule based and deterministic
Judge should be a deterministic rubric engine with optional LLM explanation

This is the safest and most realistic build path.

10.2 Why only one agent should be trained first

It reduces instability
It makes reward improvement easier to show
It makes the environment more deterministic and judge friendly
It gives a clean before versus after story

10.3 Scientist creation

The Scientist can be built from a small instruct model with structured JSON output.

The prompt should instruct it to:

Protect scientific validity
Ask for missing information before committing
Output only valid schema fields
Avoid invalid or impossible protocols

10.4 Lab Manager creation

The Lab Manager should be implemented as a deterministic policy layer that:

Checks budget
Checks equipment availability
Checks stock and restock timing
Checks staff limits
Returns templated natural language plus structured feasibility data

10.5 Judge creation

The Judge should be implemented as:

A rubric based scoring engine
An audit note generator
Optionally, an explanation layer that converts scores into readable comments for the frontend

11. How the judge agent is integrated

The Judge is integrated inside the environment.

It is called:

At the end of the episode for final reward computation
Optionally after each round for intermediate score previews

11.1 What the Judge evaluates

Whether critical controls were preserved
Whether sample size is sufficient
Whether substitutions are scientifically acceptable
Whether the plan fits budget and inventory
Whether the plan is faithful enough to the original design

11.2 What the Judge returns

rigor_score
feasibility_score
fidelity_score
total_reward
judge_notes

11.3 Important design rule

The Judge should not be the entire reward source through free form opinions.

The Judge should primarily be a deterministic rubric engine.

That makes training, replay, and scoring much more stable.

12. Reward structure

The reward should be easy to explain and hard to game.

12.1 Core reward dimensions

A. Rigor

Questions:

Did the final plan preserve critical scientific elements?
Are the controls present?
Is sample size good enough?
Is the technique valid?
Is the study duration acceptable?

B. Feasibility

Questions:

Is the plan within budget?
Is the equipment actually available?
Are the reagents in stock or restockable in time?
Is the timeline realistic?
Is staffing sufficient?

C. Fidelity

Questions:

How close is the proposed protocol to the original experiment?
Did the core method stay intact?
Did the control logic stay intact?
Is the sample size close enough?

12.2 Composite reward

Use a multiplicative core so the agent cannot cheat.

base_reward = rigor * feasibility * fidelity * 10
bonus = efficiency_bonus + communication_bonus
penalty = timeout_penalty + invalid_action_penalty + over_budget_penalty
final_reward = base_reward + bonus - penalty

12.3 Why this is good

High rigor but impossible protocol still scores poorly
Cheap but scientifically broken protocol still scores poorly
Fast, thoughtful negotiation gets rewarded
The score is intuitive for judges

13. How RL works in ReplicaLab

13.1 Simple explanation

RL works like this:

The Scientist tries an action in the environment
The environment responds through the Lab Manager and Judge logic
The Scientist gets a reward at the end
Training pushes the Scientist toward behaviors that earn higher rewards

13.2 What behavior should improve

Over time, the Scientist should learn to:

Ask better questions before proposing
Avoid impossible protocols
Preserve critical scientific details
Choose better substitutions
Reach agreement faster
Reduce invalid actions

13.3 What model should be trained

For the MVP, train only the Scientist.

That gives the clearest reward curve and the cleanest training narrative.

14. How self improvement works

14.1 MVP self improvement

Self improvement in the MVP simply means:

The Scientist gets better after repeated episodes.

That is enough to satisfy the track.

14.2 Stretch self improvement ideas

Curriculum learning from easy to medium to hard scenarios
Post episode self critique before retry
Later training of both Scientist and Lab Manager
Automatic scenario difficulty scaling

15. How world modeling is being done

World modeling means the agent must reason about a hidden world and update its internal understanding over time.

In ReplicaLab, that world includes:

What equipment exists
What equipment is missing
Which items are booked
What is in stock
What can be substituted
What is scientifically critical
What tradeoffs hurt future feasibility

The Scientist does not see all of this at once.

So it must build a mental model of the lab through dialogue, feedback, and revision.

That is why ReplicaLab fits the world modeling track strongly.

16. How long horizon planning is being done

Long horizon planning appears because the task is multi step.

A good Scientist should:

Understand the experimental goal
Ask for missing constraints
Propose an initial protocol
Revise after operational feedback
Trade off rigor against feasibility
Converge before timeout

This is not one shot generation. It is multi round planning with delayed reward.

17. How constraints work

Constraints come from a seeded scenario generator.

17.2 Difficulty levels

Easy

The lab has most of what is needed.

Medium

The lab is missing some important pieces and requires thoughtful substitutions.

Hard

The lab is missing major pieces and forces serious protocol redesign.

17.3 How constraints should change

For the MVP, keep each episode deterministic once the seed is fixed.

That means:

reset(seed=42) always produces the same paper and constraint world
The world only changes because of the agents’ actions
No random hidden shocks should happen inside an episode yet

This makes testing and replay much stronger.

18. What the end result should be

The end result is not a full system that proves whether a paper is true or false.

The end result should be:

A working OpenEnv environment
A trained Scientist agent
A stable Lab Manager policy
A Judge rubric engine
A public Hugging Face Space
A training notebook that shows reward improvement
A visual demo that clearly shows untrained versus trained behavior

The final result we are trying to fit is:

a trainable benchmark and demo for scientific replication planning under constraints

19. What the interface should look like

19.1 Frontend choice

React + Vite is the right choice.

It is faster and cleaner than trying to build a full Cursor style IDE interface.

19.2 UI layout

Left panel

Original paper summary
Key scientific requirements
Seed
Scenario type
Round counter

Middle panel

Negotiation log
Scientist messages in blue
Lab Manager messages in green
Judge summary at the end

Right panel

Current proposed protocol
Budget bar
Inventory summary
Score bars for rigor, feasibility, and fidelity
Final composite score

Bottom controls

New episode
Seed selector
Scenario selector
Replay slider
Before versus after training toggle

19.3 Fallback option

If the custom UI slips, use the OpenEnv web interface as a fallback and polish only the essential display panels.

20. Architecture overview

flowchart TD
    A[Scenario Templates] --> B[Scenario Engine]
    B --> C[ReplicaLabEnv]
    C --> D[Scientist Policy]
    C --> E[Lab Manager Policy]
    C --> F[Judge Rubric Engine]
    D --> C
    E --> C
    F --> G[Step Result and Logs]
    C --> G
    G --> H[FastAPI and WebSocket Server]
    H --> I[React Vite Frontend]
    H --> J[Colab Training Client]
    J --> K[TRL or Unsloth RL Training]
    K --> L[Reward Curves and Evaluation]

21. How exactly we are using the hackathon tools

21.1 OpenEnv 0.2.1

Used for:

Defining the environment interface
Creating the stateful RL world
Serving the environment over FastAPI and WebSocket
Enabling clients to connect locally or remotely

21.2 Hugging Face Spaces

Used for:

Public deployment
Judge accessible demo hosting
Satisfying the official submission requirement

21.3 Docker

Used for:

Packaging the backend and optional frontend
Ensuring the app runs on port 7860 in HF Spaces

21.4 Colab

Used for:

The required minimal training script
Running rollouts against the environment
Plotting reward improvement

21.5 TRL or Unsloth

Used for:

Training the Scientist policy
Applying RL against the environment reward
Producing visible reward curves and before versus after behavior

21.6 Matplotlib

Used for:

Reward curve visualization
Component score plots
Training summary charts

21.7 GitHub

Used for:

Public source code
README
Notebook storage
Architecture documentation

21.8 YouTube

Used for:

The one minute demo video required by the hackathon

22. Scope of work

22.1 In scope for the hackathon MVP

OpenEnv environment implementation
3 scenario families
Scientist as the trainable policy
Rule based Lab Manager
Deterministic Judge rubric engine
FastAPI and WebSocket server
Docker deployment
Hugging Face Space
Colab training notebook
Reward curve
React Vite frontend or clean fallback UI
Public GitHub repo
Demo video
README

22.2 Stretch scope if ahead of schedule

LLM based Lab Manager
Judge explanation LLM
Live replay mode
Before versus after split screen
More scientific domains
Difficulty curriculum

22.3 Out of scope

Proving a real paper is factually true or false
Full autonomous laboratory automation
Real wet lab execution
Arbitrary paper ingestion from the internet
Full self play between multiple LLM agents
Complex enterprise integrations unrelated to the core demo

23. Folder structure

replicalab/
├── README.md
├── pyproject.toml
├── openenv.yaml
├── .dockerignore
├── replicalab/
│   ├── __init__.py
│   ├── models.py
│   ├── client.py
│   ├── prompts/
│   │   ├── scientist.txt
│   │   ├── lab_manager.txt
│   │   └── judge.txt
│   ├── scenarios/
│   │   ├── templates.py
│   │   ├── cell_biology.py
│   │   ├── ml_benchmark.py
│   │   └── behavioral_psych.py
│   ├── scoring/
│   │   ├── rubric.py
│   │   ├── rigor.py
│   │   ├── feasibility.py
│   │   └── fidelity.py
│   ├── agents/
│   │   ├── scientist_policy.py
│   │   ├── lab_manager_policy.py
│   │   └── judge_policy.py
│   ├── env/
│   │   └── replicalab_env.py
│   ├── utils/
│   │   ├── seed.py
│   │   ├── validation.py
│   │   └── logging.py
│   └── outputs/
│       ├── logs/
│       ├── replays/
│       └── plots/
├── server/
│   ├── app.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── package.json
│   ├── vite.config.ts
│   └── src/
│       ├── App.tsx
│       ├── components/
│       └── pages/
├── notebooks/
│   └── train_colab.ipynb
└── tests/
    ├── test_env.py
    ├── test_reward.py
    ├── test_scenarios.py
    └── test_server.py

24. How the judges are likely to judge the project

The hackathon judging criteria emphasize:

Environment innovation
Storytelling
Training improvement
Reward and pipeline coherence

24.1 Why ReplicaLab scores well

Environment Innovation

Strong because this is a partially observable scientific negotiation world, not a toy single prompt task.

Storytelling

Strong because the Scientist versus Lab Manager framing is intuitive and memorable.

Training Improvement

Strong because the Scientist can visibly improve through RL and reward curves.

Reward and Pipeline Coherence

Strong because the scoring dimensions are simple and explainable.

24.2 Ideal judge demo flow

Show the problem in one sentence
Start a seeded episode
Show the paper and lab constraints
Show the back and forth negotiation
Show the score breakdown
Replay the same seed with the trained Scientist
Show higher reward and better decision quality

25. Completion rate expectations

25.1 Project completion reality

With a focused 4 person team, we should aim to complete:

90 percent of the judge critical MVP

Even if that is only around 60 percent of the full dream vision, that is completely fine.

25.2 Environment success metrics

Track these metrics:

Average reward
Agreement rate
Average rounds to agreement
Invalid action rate
Reward by scenario difficulty

A strong demo should show:

Higher reward after training
Higher agreement rate after training
Fewer invalid proposals after training
Faster convergence after training

26. Team split for 4 people

Person 1: Environment and scoring owner

Owns:

Scenario generation
Environment state and transitions
Constraint system
Reward logic
Tests

Person 2: RL and model owner

Owns:

Scientist prompts and action schema
Training notebook
TRL or Unsloth integration
Reward curves
Before versus after evaluation

Person 3: Backend and deployment owner

Owns:

FastAPI server
WebSocket protocol
Docker image
HF Spaces deployment
Logs and replay endpoints

Person 4: Frontend and story owner

Owns:

React Vite UI
Visual score panels
Demo polish
README
One minute YouTube demo

27. Workflow for the team

27.1 Build order

Freeze environment schema and reward structure
Build one scenario end to end
Add deterministic Lab Manager
Add Judge rubric engine
Connect FastAPI and WebSocket serving
Add basic frontend
Add Colab training notebook
Deploy to HF Space
Add remaining scenarios
Record demo and finish README

27.2 Runtime workflow

User starts a new episode
The environment generates a seeded paper and lab
The Scientist receives its observation
The Lab Manager receives its observation
The Scientist proposes or asks a question
The Lab Manager replies with feasibility data
The environment updates state
The Judge computes intermediate or final scores
The episode ends on agreement or timeout
The replay is stored for demo and evaluation

28. Revenue model

This is not needed for judging, but it is useful for investor or product framing.

28.1 Possible revenue paths

A. Enterprise experiment planning assistant

Sell a planning and auditing tool to biotech and research organizations.

B. Scientific AI benchmark licensing

Offer ReplicaLab as a benchmark for labs or AI teams evaluating scientific agents.

C. Simulation API

Charge for API access to scenarios, scoring, and replay infrastructure.

D. Workflow software expansion

Expand later into experiment design, lab operations support, and protocol adaptation copilots.

29. Five year old explanation

Imagine two kids want to bake a cake.

One kid knows the recipe
One kid knows what is inside the kitchen

The recipe kid says, “We need chocolate.”

The kitchen kid says, “We do not have chocolate, but we have cocoa.”

Then they talk until they find the best cake they can make.

If the cake still tastes good, uses what the kitchen has, and finishes on time, they get a star.

ReplicaLab is that, but for science experiments.

30. Final recommended positioning

30.1 Best main pitch

ReplicaLab is an OpenEnv scientific negotiation environment where a Scientist agent and a Lab Manager agent collaborate to design valid experiment replications under real world lab constraints. We train the Scientist with RL so it learns to ask better questions, make better tradeoffs, and reach better replication plans over time.

30.2 Best track framing

Primary: Multi Agent Interactions and World Modeling

Supporting: Long Horizon Planning and Self Improvement

30.3 Best sponsor framing

Primary sponsor fit: Halluminate and Snorkel AI

Optional supporting narrative: Fleet AI through the Judge as an oversight layer

30.4 Best MVP framing

Train only the Scientist
Keep the Lab Manager rule based
Keep the Judge rubric based
Ship 3 scenario families
Show one strong before versus after training demo

31. Final “done” definition

ReplicaLab is done for the hackathon when we have:

A working OpenEnv environment
A deployed HF Space on port 7860
A public GitHub repo
A Colab notebook with visible reward improvement
A one minute YouTube demo
A clear README
A clean story that judges understand in under one minute

That is the real finish line.

ReplicaLab Master Blueprint

1. Executive summary

2. The real world problem we are targeting

3. One line pitch

4. Which hackathon tracks we are following

4.1 Primary tracks

A. Multi Agent Interactions

B. World Modeling, Professional Tasks

4.2 Supporting tracks

C. Long Horizon Planning and Instruction Following

D. Self Improvement

4.3 Track summary

5. Sponsor and partner alignment

5.1 Best sponsor fits

Halluminate

Snorkel AI

5.2 Good optional fit

Fleet AI

5.3 Resource fit

6. Why this is truly an environment

6.1 State

6.2 Actions

6.3 Observations

6.4 Transitions

6.5 Reward

7. The core environment loop

7.1 One episode

7.2 Environment methods

8. Scenario environments inside ReplicaLab

8.1 MVP scenario families

A. Cell Biology

B. Machine Learning Benchmark Replication

C. Behavioral Psychology Survey Study

8.2 Stretch scenario families

9. How each model interacts with the others

9.1 Scientist agent

9.2 Lab Manager agent

9.3 Judge agent

10. How the agents should be implemented

10.1 MVP implementation choice

10.2 Why only one agent should be trained first

10.3 Scientist creation

10.4 Lab Manager creation

10.5 Judge creation

11. How the judge agent is integrated

11.1 What the Judge evaluates

11.2 What the Judge returns

11.3 Important design rule

12. Reward structure

12.1 Core reward dimensions

A. Rigor

B. Feasibility

C. Fidelity

12.2 Composite reward

12.3 Why this is good

13. How RL works in ReplicaLab

13.1 Simple explanation

13.2 What behavior should improve

13.3 What model should be trained

14. How self improvement works

14.1 MVP self improvement

14.2 Stretch self improvement ideas

15. How world modeling is being done

16. How long horizon planning is being done

17. How constraints work

17.1 Constraint categories

17.2 Difficulty levels

Easy

Medium

Hard

17.3 How constraints should change

18. What the end result should be

19. What the interface should look like

19.1 Frontend choice

19.2 UI layout

Left panel

Middle panel

Right panel

Bottom controls

19.3 Fallback option