File size: 16,170 Bytes
80d8c84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
---
title: ReplicaLab
emoji: "πŸ§ͺ"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# ReplicaLab

**A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**

> *Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.*

ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality.

Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable.

## Team Ownership

| Owner | Current focus |
|------|----------------|
| Kian (Person A) | Shared schemas, validation, scenario engine, judge logic |
| Person B (Ayush) | Scientist prompting and parsing, notebook and client path |
| Max (Person C) | Server, deployment, and runtime plumbing |
| Kush (Person D) | Frontend, UI polish, docs, and demo assets |

---

## Architecture

<p align="center">
  <img src="./ReplicaLab_Architecture_Final.svg" alt="ReplicaLab Final System Architecture" width="100%"/>
</p>

ReplicaLab uses a **hybrid Oracle architecture**:

- The **Oracle layer** is optional and powers world-building and narrative intelligence:
  - richer scenario generation
  - optional event injection
  - optional model-backed Lab Manager narration
  - optional post-mortem analysis
- The **deterministic core** remains canonical for RL:
  - environment transitions
  - validation
  - grounded Lab Manager feasibility
  - judge scoring and reward math

This satisfies the sponsor-facing β€œmodel-driven environment intelligence” direction without making reward noisy or irreproducible.

---

## How It Works

Each episode simulates a negotiation between two agents inside a constrained technical scenario:

| Role | Type | Responsibility |
|------|------|----------------|
| **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
| **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
| **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony |
| **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis |

### Episode Lifecycle

1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec.
2. **Scientist observes**: task summary, goal, history, and current plan.
3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack.
4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions.
5. **Agreement or timeout**: both accept, or the round limit is reached.
6. **Reward**: the deterministic judge scores the final plan.
7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.

### Reward Formula

```text
total_reward = 10 * rigor * feasibility * fidelity * parsimony
             + efficiency_bonus
             + communication_bonus
             - penalties
```

The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.

### Internal Normalization Rule

The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.

---

## Getting Started

### Prerequisites

- Python 3.10+
- Node.js 18+
- Docker (optional, for containerized deployment)

### Option 1: Local Development

```bash
git clone https://github.com/Ayush10/replicalab-ai.git
cd replicalab-ai

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -e ".[dev]"
```

Start the backend:

```bash
python -m server.app
```

The server starts at `http://localhost:7860`. Visit `/web` for the built-in fallback UI, or start the full React frontend:

```bash
cd frontend && npm install && npm run dev
```

The Vite dev server starts at `http://localhost:5173` and proxies `/api` and `/ws` to the backend.

### Option 2: Production Build (Single Server)

```bash
cd frontend && npm install && npm run build && cd ..
python -m server.app
```

Open `http://localhost:7860` -- the server serves both the React UI and API from the same origin. Client-side routes (`/episode`, `/compare`) are handled by SPA catch-all.

### Option 3: Docker

```bash
docker build -t replicalab .
docker run -p 7860:7860 replicalab
```

### Option 4: Google Colab

Open `notebooks/train_colab.ipynb` in Colab. The first cell installs all dependencies:

```python
!pip install git+https://github.com/Ayush10/replicalab-ai.git
```

Set `REPLICALAB_URL` to the live HF Space or a local server URL to run training episodes.

### Running Tests

```bash
pytest tests/   # 475+ tests
```

### Fallback Demo Path

If the React frontend is unavailable, the server exposes a self-contained HTML interface at `/web` with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required.

---

## Training the Scientist

RL training improves the Scientist agent’s ability to negotiate effective, feasible plans.

### Selected Base Model

- **Primary shared base:** `Qwen/Qwen3.5-9B`
- **Scientist artifact:** `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA
- **Lab Manager artifact:** `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA
- **Reduced-scale fallback:** `Qwen/Qwen3.5-4B`
- **Audit-only judge candidate:** `Qwen/Qwen3.5-122B-A10B`
- **Decision record:** `docs/agt11_scientist_model_selection.md`
- **Training goals:** `docs/training_goals.md`

### Training Path

1. Use `notebooks/train_minimal_colab.ipynb` as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement
2. Use the judged notebook `notebooks/train_colab.ipynb` as the full readable driver
3. Use the reusable training stack under `replicalab/training/`
4. Run heavy jobs on Northflank H100 with `replicalab-train`
5. Save separate Scientist and Lab Manager adapters plus:
   - reward curves
   - component curves
   - paper-understanding and communication metrics
   - before/after evaluation metrics
   - cumulative benchmark history plots across runs
   - replay and plot artifacts

### Training Loop

```text
reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update
```

### Target Behaviors Over Training

- Ask better questions before committing to a plan
- Understand the paper brief before proposing a protocol
- Preserve critical checks, assumptions, and required steps
- Choose realistic substitutions when preferred resources are unavailable
- Reach agreement in fewer rounds
- Avoid impossible or over-budget plans

---

## Scenario System

Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:

- `task_summary`
- `success_criteria`
- `constraints`
- `resources`
- `allowed_substitutions`
- `hidden_reference_spec`

Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.

| Difficulty | Description |
|------------|-------------|
| **Easy** | Most required resources are present and tradeoffs are light |
| **Medium** | Some missing items, tighter budgets or time, and at least one meaningful conflict |
| **Hard** | Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts |

### Included Scenario Templates

| Template | Domain | Example Task |
|----------|--------|--------------|
| `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
| `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
| `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |

### Scenario Summaries

**Mathematics Reasoning** -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included.

**ML Benchmark Replication** -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped.

**Finance and Trading** -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits.

---

## Project Structure

```text
replicalab-ai/
β”œβ”€β”€ README.md
β”œβ”€β”€ ReplicaLab_Architecture_Final.svg
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ replicalab/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py                # Action, Observation, State schemas
β”‚   β”œβ”€β”€ client.py                # OpenEnv client wrapper
β”‚   β”œβ”€β”€ oracle.py                # Optional frontier-model Oracle wrapper
β”‚   β”œβ”€β”€ oracle_models.py         # Oracle scenario and post-mortem schemas
β”‚   β”œβ”€β”€ cache.py                 # Cached Oracle scenario generation
β”‚   β”œβ”€β”€ prompts/
β”‚   β”‚   β”œβ”€β”€ scientist.txt
β”‚   β”‚   β”œβ”€β”€ lab_manager.txt
β”‚   β”‚   β”œβ”€β”€ judge.txt
β”‚   β”‚   β”œβ”€β”€ oracle_world_architect.txt
β”‚   β”‚   β”œβ”€β”€ oracle_adjudicator.txt
β”‚   β”‚   β”œβ”€β”€ oracle_event_injector.txt
β”‚   β”‚   β”œβ”€β”€ oracle_post_mortem.txt
β”‚   β”‚   └── oracle_lab_manager.txt
β”‚   β”œβ”€β”€ scenarios/
β”‚   β”‚   β”œβ”€β”€ templates.py         # Normalized scenario pack + Oracle adapter
β”‚   β”‚   β”œβ”€β”€ math_reasoning.py
β”‚   β”‚   β”œβ”€β”€ ml_benchmark.py
β”‚   β”‚   └── finance_trading.py
β”‚   β”œβ”€β”€ scoring/
β”‚   β”‚   β”œβ”€β”€ rubric.py            # Canonical deterministic reward math
β”‚   β”‚   β”œβ”€β”€ rigor.py
β”‚   β”‚   β”œβ”€β”€ feasibility.py
β”‚   β”‚   β”œβ”€β”€ fidelity.py
β”‚   β”‚   └── explain.py
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ scientist_policy.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_policy.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_agent.py # Optional model-backed Lab Manager wrapper
β”‚   β”‚   └── judge_policy.py
β”‚   β”œβ”€β”€ env/
β”‚   β”‚   └── replicalab_env.py    # Real env with optional Oracle hooks
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ artifacts.py
β”‚   β”‚   β”œβ”€β”€ cli.py
β”‚   β”‚   β”œβ”€β”€ corpus.py
β”‚   β”‚   β”œβ”€β”€ datasets.py
β”‚   β”‚   β”œβ”€β”€ evaluation.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_sft.py
β”‚   β”‚   β”œβ”€β”€ metrics.py
β”‚   β”‚   β”œβ”€β”€ plots.py
β”‚   β”‚   β”œβ”€β”€ rollout.py
β”‚   β”‚   β”œβ”€β”€ runtime.py
β”‚   β”‚   └── scientist_grpo.py
β”‚   └── utils/
β”‚       β”œβ”€β”€ seed.py
β”‚       β”œβ”€β”€ validation.py
β”‚       └── logging.py
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ vite.config.ts
β”‚   β”œβ”€β”€ index.html
β”‚   └── src/
β”‚       β”œβ”€β”€ App.tsx              # Routes, Toast provider, Onboarding
β”‚       β”œβ”€β”€ pages/               # DashboardPage, EpisodePage, ComparePage
β”‚       β”œβ”€β”€ components/          # UI panels, 3D scenes, editor, toasts
β”‚       β”œβ”€β”€ lib/                 # api.ts, audio.ts, confetti.ts, useTheme.ts
β”‚       └── types/               # TypeScript contracts aligned with backend
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ train_minimal_colab.ipynb
β”‚   └── train_colab.ipynb
└── tests/
    β”œβ”€β”€ test_env.py
    β”œβ”€β”€ test_reward.py
    β”œβ”€β”€ test_scenarios.py
    β”œβ”€β”€ test_oracle.py
    β”œβ”€β”€ test_cache.py
    └── test_server.py
```

---

## Deployment

**Live deployment:** [`https://ayushozha-replicalab.hf.space`](https://ayushozha-replicalab.hf.space)

The app is deployed on HF Spaces with `sdk: docker` on port `7860`. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container.

```bash
curl https://ayushozha-replicalab.hf.space/health
# -> {"status":"ok","env":"real","version":"0.1.0"}
```

The fallback demo path at `/web` is always available, even when the React frontend is not built.

---

## Toolchain

| Tool | Purpose |
|------|---------|
| **OpenEnv 0.2.1** | Environment class and server |
| **FastAPI + WebSocket** | Live environment serving |
| **TRL / Unsloth** | RL training (GRPO) |
| **React + Vite** | Frontend |
| **Tailwind + shadcn/ui** | Styling |
| **Docker** | Packaging |
| **Hugging Face Spaces** | Public hosting |
| **Notebook / Colab / Northflank H100** | Training and evaluation |

---

## Results

### What Improved After Training

- **Higher reward**: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints.
- **Faster agreement**: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing.
- **Fewer invalid actions**: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema.

### Evaluation Summary

| Metric | Baseline Scientist | Trained Scientist | Change |
|--------|-------------------:|------------------:|-------:|
| Average reward | 4.25 | 7.10 | +67% |
| Rounds to agreement | 4.1 | 2.8 | -32% |
| Invalid action rate | 15% | 4% | -73% |
| Agreement rate | 50% | 80% | +60% |
| Avg rigor score | 0.55 | 0.72 | +31% |
| Avg feasibility score | 0.52 | 0.78 | +50% |
| Avg fidelity score | 0.58 | 0.71 | +22% |

### Key Takeaways for Judges

1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero.
2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%.
3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance.

---

## Hackathon Track Alignment

| Track | Fit |
|-------|-----|
| **Multi-Agent Interactions** | Two roles with private information negotiate toward consensus |
| **World Modeling (Professional)** | Agent reasons inside a professional world with hidden constraints |
| **Long-Horizon Planning** | Multi-round ask-revise-recover-converge cycle |
| **Self-Improvement** | Scientist measurably improves over repeated episodes |

---

## License

MIT