File size: 15,563 Bytes
80d8c84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
# ReplicaLab

**A multi-agent scientific replication environment built on OpenEnv**

---

## Overview

ReplicaLab is a virtual scientific replication world. Each episode generates an original experiment and a constrained lab, then two agents negotiate a replication plan:

- A **Scientist** agent that protects scientific validity.
- A **Lab Manager** agent that protects cost, equipment, time, staffing, and feasibility.

They negotiate over multiple rounds. If they converge on a sound, feasible protocol, the episode yields a high reward. If they fail, overspend, or strip away critical scientific elements, the reward stays low.

The real-world motivation is the **replication crisis**: published protocols describe ideal conditions, but real labs face missing tools, tight budgets, booking conflicts, reagent shortages, and limited personnel. ReplicaLab trains an agent to answer a single question:

> *How do we adapt an experiment without breaking the science?*

---

## Hackathon Track Alignment

ReplicaLab touches four of the five OpenEnv Hackathon problem statements.

### Primary Tracks

| Track | Fit |
|---|---|
| **Multi-Agent Interactions** | Two roles hold different private information and must negotiate toward consensus. Strongest fit. |
| **World Modeling (Professional)** | The agent reasons inside a professional world with hidden constraints. Very strong fit. |

### Supporting Tracks

| Track | Fit |
|---|---|
| **Long-Horizon Planning** | The agent must ask, revise, recover, and converge across multiple rounds rather than solving in one step. |
| **Self-Improvement** | The same environment trains the Scientist so its behavior improves over repeated episodes. |

**Demo framing:** Lead with Multi-Agent + World Modeling. Support with Long-Horizon + Self-Improvement.

---

## Why This Is an Environment

ReplicaLab is not a prompt. It satisfies all five properties of a proper environment:

1. **State** β€” Current paper, lab constraints, round number, negotiation history, proposed protocol, spent budget, remaining stock, done flag.
2. **Actions** β€” The Scientist can propose, revise, ask questions, or accept. The Lab Manager can report feasibility, suggest substitutions, reject, or accept.
3. **Transitions** β€” Each action mutates the world: budget consumed, protocol updated, round counter incremented, dialogue history extended.
4. **Observations** β€” Each role sees a different partial view of the world (partially observable).
5. **Reward** β€” The environment scores the quality of the final plan.

OpenEnv provides exactly this pattern: typed `Action`, `Observation`, and `State` models with `reset()`, `step()`, and `state()` methods, wrapped in FastAPI + WebSocket serving with per-session instances.

---

## Episode Lifecycle

A single episode unfolds as follows:

1. **Reset** β€” `reset(seed=42)` creates one paper template, one lab constraint set, and one hidden evaluation rubric.
2. **Scientist observes** β€” Paper summary, experiment goal, conversation history, current proposed protocol.
3. **Lab Manager observes** β€” Budget, equipment, booking calendar, reagents, staff, safety rules, current proposal.
4. **Scientist acts** β€” Proposes, revises, asks, or accepts.
5. **Lab Manager responds** β€” Reports feasibility, suggests substitutions, or accepts.
6. **State updates** β€” Environment transitions.
7. **Repeat** for a fixed number of rounds or until both sides accept (or timeout).
8. **Reward returned** β€” The environment scores the final protocol.

### Key Design Decision

For the MVP, only the **Scientist is trained**.

| Role | Implementation |
|---|---|
| **Scientist** | Trainable LLM policy |
| **Lab Manager** | Deterministic rule-based policy with readable responses |
| **Judge** | Deterministic rubric engine, with optional LLM explanation layer |

This gives stable environment dynamics and clean reward signals for a hackathon setting.

---

## The Three Roles

### A. Scientist Agent

The Scientist protects scientific quality. It reasons about essential controls, safe sample-size reductions, valid substitutions, and the minimum viable version of an experiment that still tests the claim.

**Action schema:**

```json
{
  "action_type": "propose_protocol | revise_protocol | request_info | accept",
  "sample_size": 60,
  "controls": ["vehicle_control", "positive_control"],
  "technique": "WST1",
  "duration_days": 7,
  "required_equipment": ["plate_reader", "incubator"],
  "required_reagents": ["drug_A", "WST1_kit"],
  "questions": ["Do we have a plate reader free this week?"],
  "rationale": "WST1 is an acceptable substitute for MTT in this template"
}
```

### B. Lab Manager Agent

The Lab Manager protects feasibility: budget, equipment availability, machine bookings, reagent delivery timelines, and staffing. For the MVP this is a rule-based system (deterministic constraint checker, substitution suggester, cost estimator, booking checker, natural-language response template) to keep environment behavior stable and debuggable.

### C. Judge

The Judge is a **rubric-backed scorer**, not a free-form LLM.

It receives the original paper, hidden minimum-viable replication spec, final proposed protocol, actual lab constraints, and negotiation transcript. It outputs:

- Rigor score
- Feasibility score
- Fidelity score
- Final reward
- Audit notes

An optional LLM explanation layer can translate the audit into readable notes for the UI.

---

## Reward Structure

### Core Dimensions

| Dimension | What It Measures | Examples |
|---|---|---|
| **Rigor** | Did the agent preserve the important science? | Sample size, controls, method validity, statistics, duration |
| **Feasibility** | Can this lab actually run the plan? | Budget, equipment availability, stock, timeline, staffing |
| **Fidelity** | How close is the plan to the original experiment? | Same technique or valid substitute, same control logic, similar sample size, same study aim |

### Formula

```
total_reward = 10 Γ— rigor Γ— feasibility Γ— fidelity
             + efficiency_bonus
             + communication_bonus
             βˆ’ penalties
```

The multiplicative core prevents fake wins: a scientifically perfect but impossible plan scores low, and a cheap but scientifically broken plan also scores low.

### Penalties

Applied for timeout, exceeding budget, invalid structure, missing critical controls, and bad substitutions.

---

## Reinforcement Learning

RL improves the **Scientist policy**.

1. Environment resets.
2. Scientist generates an action.
3. Lab Manager replies.
4. Episode ends with a reward.
5. Training loop adjusts the Scientist toward higher-reward behaviors.

**Target behaviors over training:**

- Ask better questions before committing.
- Preserve critical controls.
- Choose realistic substitutions.
- Reach agreement faster.
- Avoid over-budget plans.

TRL supports OpenEnv-style training through a custom `rollout_func` for stepping through an environment with environment-computed rewards. GRPO supports custom reward functions. Unsloth provides GRPO notebooks designed for this kind of training.

---

## Self-Improvement

For the MVP, self-improvement means the Scientist gets measurably better through repeated episodes. That is sufficient for the track.

**Stretch goals (time permitting):**

- **Curriculum learning** β€” Easy scenarios first, then medium, then hard.
- **Self-critique** β€” After a failed episode, the agent reviews a short audit and retries.
- **Self-play** β€” Train both Scientist and Lab Manager.

---

## World Modeling and Long-Horizon Planning

### World Modeling

The agent must build an internal model of a hidden world: what the lab has, what it lacks, what is booked, what is scientifically critical, what is flexible, and how choices affect future feasibility. None of this is fully visible, so the agent infers the world through negotiation.

### Long-Horizon Planning

The best move is rarely the first move. A strong Scientist follows a chain: understand the paper goal, ask what is available, propose a first plan, revise after constraints surface, trade off cost against rigor, and reach agreement before timeout. That is multi-step planning, not a single answer.

---

## Constraint System

Constraints come from a **scenario generator**. Each scenario template defines required equipment, optional substitutes, must-keep controls, minimum sample size, minimum duration, typical costs, and likely bottlenecks. Difficulty modifies them:

| Difficulty | Description |
|---|---|
| **Easy** | Lab has most of what is needed. |
| **Medium** | Some missing items, tighter budget, tighter time. |
| **Hard** | Major shortages, bigger tradeoffs, booking conflicts. |

For the MVP, the world is **deterministic within each episode**: the initial seed defines the entire scenario, resources change only through agent choices, and there are no random surprise events. This makes debugging, replay, and demo presentations much stronger.

---

## Interface Design

### Layout

| Section | Content |
|---|---|
| **Left Panel** | Original paper summary, challenge label, seed, round counter |
| **Middle Panel** | Negotiation log (Scientist in blue, Lab Manager in green, Judge audit at end) |
| **Right Panel** | Current proposed protocol, lab inventory snapshot, budget bar, score bars for rigor/feasibility/fidelity |
| **Bottom Controls** | New episode, seed selector, scenario selector, replay slider, before-vs-after training toggle |

### Implementation

- **Demo UI:** Custom React + Vite app hitting the FastAPI + WebSocket backend.
- **Fallback UI:** OpenEnv built-in `/web` interface.

---

## Folder Structure

```
replicalab/
β”œβ”€β”€ README.md
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ replicalab/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py
β”‚   β”œβ”€β”€ client.py
β”‚   β”œβ”€β”€ prompts/
β”‚   β”‚   β”œβ”€β”€ scientist.txt
β”‚   β”‚   β”œβ”€β”€ lab_manager.txt
β”‚   β”‚   └── judge.txt
β”‚   β”œβ”€β”€ scenarios/
β”‚   β”‚   β”œβ”€β”€ templates.py
β”‚   β”‚   β”œβ”€β”€ cell_biology.py
β”‚   β”‚   β”œβ”€β”€ ml_benchmark.py
β”‚   β”‚   └── behavioral_psych.py
β”‚   β”œβ”€β”€ scoring/
β”‚   β”‚   β”œβ”€β”€ rubric.py
β”‚   β”‚   β”œβ”€β”€ rigor.py
β”‚   β”‚   β”œβ”€β”€ feasibility.py
β”‚   β”‚   └── fidelity.py
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ scientist_policy.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_policy.py
β”‚   β”‚   └── judge_policy.py
β”‚   β”œβ”€β”€ env/
β”‚   β”‚   └── replicalab_env.py
β”‚   └── utils/
β”‚       β”œβ”€β”€ seed.py
β”‚       β”œβ”€β”€ validation.py
β”‚       └── logging.py
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ vite.config.ts
β”‚   └── src/
β”‚       β”œβ”€β”€ App.tsx
β”‚       β”œβ”€β”€ components/
β”‚       └── pages/
β”œβ”€β”€ notebooks/
β”‚   └── train_colab.ipynb
└── tests/
    β”œβ”€β”€ test_env.py
    β”œβ”€β”€ test_reward.py
    β”œβ”€β”€ test_scenarios.py
    └── test_server.py
```

---

## Toolchain

| Tool | Purpose |
|---|---|
| **OpenEnv 0.2.1** | Environment class and server |
| **Hugging Face Spaces** | Public hosting (Docker SDK, port 7860) |
| **Docker** | Packaging server + frontend |
| **Google Colab** | Required training notebook |
| **TRL / Unsloth** | RL training on the Scientist |
| **FastAPI + WebSocket** | Live environment serving |
| **React + Vite** | Frontend |
| **Tailwind + shadcn/ui** | Styling |
| **Matplotlib** | Reward curves in Colab |
| **CSV / JSONL logs** | Replay and debugging |

---

## Scope

### In Scope (MVP)

1. One working OpenEnv environment
2. Three scenario templates (Cell Biology, ML Benchmark, Behavioral Psychology)
3. Trainable Scientist agent
4. Rule-based Lab Manager
5. Judge rubric engine
6. Reward logging
7. HF Space deployment
8. Colab RL notebook with reward curve
9. Public repo
10. One-minute YouTube demo
11. Clean README
12. React UI or polished `/web` fallback

### Stretch (Only If Ahead)

- LLM Lab Manager
- Live replay mode
- Side-by-side before-vs-after comparison
- More scenario families
- Judge explanation LLM
- Curriculum learning

### Out of Scope

- Proving a real paper is true or false
- Parsing arbitrary papers from the internet
- Full autonomous lab automation
- Real wet-lab execution
- Full multi-model self-play
- Enterprise workflow integrations

---

## Team Roles (4 People)

| Person | Ownership |
|---|---|
| **P1: Environment + Reward** | Scenario engine, environment state, constraint logic, reward logic, tests |
| **P2: RL + Model** | Scientist policy prompt, TRL/Unsloth notebook, rollout loop, reward curves, before/after evaluation |
| **P3: Backend + Deploy** | FastAPI, WebSocket, Docker, HF Space, logging, replay API |
| **P4: Frontend + Story** | React/Vite UI, visualization, demo flow, README, YouTube demo |

Everyone shares bug fixing, testing, and final polish.

---

## Build Sequence

1. Freeze the environment schema
2. Implement one scenario end to end
3. Add reward and logs
4. Add rule-based Lab Manager
5. Add Scientist baseline
6. Connect Colab training
7. Add React UI
8. Deploy to HF
9. Record demo
10. Write README

---

## Judging Criteria and Demo Strategy

| Criterion (Weight) | How ReplicaLab Scores |
|---|---|
| **Environment Innovation (40%)** | Partially observable, multi-role scientific negotiation world, not a toy chat task. |
| **Storytelling (30%)** | Scientist vs. Lab Manager is instantly understandable. |
| **Training Improvement (20%)** | Same seed, before training vs. after training, visible reward improvement. |
| **Pipeline Setup (10%)** | Clean reward formula, structured logs, reproducible Colab notebook. |

### Demo Flow

1. New episode with a specific seed.
2. Paper appears, Scientist proposes.
3. Lab Manager pushes back.
4. Negotiation unfolds over rounds.
5. Judge shows final scores.
6. Replay same seed with the trained model.
7. Trained model asks smarter questions, avoids bad substitutions, earns higher reward.

---

## Success Metrics

| Metric | Untrained Scientist | Trained Scientist |
|---|---|---|
| Average reward | Lower | Higher |
| Rounds to agreement | More | Fewer |
| Invalid action rate | Higher | Lower |
| Agreement rate | Lower | Higher |

---

## Sponsor Alignment

| Target | Rationale |
|---|---|
| **Halluminate** | True multi-actor environment with different beliefs and information per role. |
| **Snorkel AI** | Simulated experts in the loop; the Scientist learns by interacting with expert-style roles. |
| **Fleet AI** (alternate) | Judge as an explicit oversight layer monitoring and explaining the two agents. |

---

## Real-World Applications

**Target users:** Biotech teams, pharma R&D groups, contract research organizations, university labs, cloud lab platforms, AI labs training scientific agents.

**Potential revenue paths:** Enterprise experiment planning software, evaluation benchmark licensing, simulation API access, experiment design copilot products.

---

## The Simple Explanation

Imagine two kids want to bake a cake. One knows the **recipe**. The other knows what is in the **kitchen**. The recipe kid says they need eggs, milk, flour, and chocolate. The kitchen kid says there is no chocolate, but there is cocoa. They talk and make the best cake they can. If the cake stays tasty, uses what the kitchen has, and finishes on time, they earn a star.

ReplicaLab is that, but for science.