Spaces:
Running
Running
File size: 16,170 Bytes
80d8c84 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 | ---
title: ReplicaLab
emoji: "π§ͺ"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---
# ReplicaLab
**A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
> *Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.*
ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality.
Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable.
## Team Ownership
| Owner | Current focus |
|------|----------------|
| Kian (Person A) | Shared schemas, validation, scenario engine, judge logic |
| Person B (Ayush) | Scientist prompting and parsing, notebook and client path |
| Max (Person C) | Server, deployment, and runtime plumbing |
| Kush (Person D) | Frontend, UI polish, docs, and demo assets |
---
## Architecture
<p align="center">
<img src="./ReplicaLab_Architecture_Final.svg" alt="ReplicaLab Final System Architecture" width="100%"/>
</p>
ReplicaLab uses a **hybrid Oracle architecture**:
- The **Oracle layer** is optional and powers world-building and narrative intelligence:
- richer scenario generation
- optional event injection
- optional model-backed Lab Manager narration
- optional post-mortem analysis
- The **deterministic core** remains canonical for RL:
- environment transitions
- validation
- grounded Lab Manager feasibility
- judge scoring and reward math
This satisfies the sponsor-facing βmodel-driven environment intelligenceβ direction without making reward noisy or irreproducible.
---
## How It Works
Each episode simulates a negotiation between two agents inside a constrained technical scenario:
| Role | Type | Responsibility |
|------|------|----------------|
| **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
| **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
| **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony |
| **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis |
### Episode Lifecycle
1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec.
2. **Scientist observes**: task summary, goal, history, and current plan.
3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack.
4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions.
5. **Agreement or timeout**: both accept, or the round limit is reached.
6. **Reward**: the deterministic judge scores the final plan.
7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.
### Reward Formula
```text
total_reward = 10 * rigor * feasibility * fidelity * parsimony
+ efficiency_bonus
+ communication_bonus
- penalties
```
The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.
### Internal Normalization Rule
The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
---
## Getting Started
### Prerequisites
- Python 3.10+
- Node.js 18+
- Docker (optional, for containerized deployment)
### Option 1: Local Development
```bash
git clone https://github.com/Ayush10/replicalab-ai.git
cd replicalab-ai
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
```
Start the backend:
```bash
python -m server.app
```
The server starts at `http://localhost:7860`. Visit `/web` for the built-in fallback UI, or start the full React frontend:
```bash
cd frontend && npm install && npm run dev
```
The Vite dev server starts at `http://localhost:5173` and proxies `/api` and `/ws` to the backend.
### Option 2: Production Build (Single Server)
```bash
cd frontend && npm install && npm run build && cd ..
python -m server.app
```
Open `http://localhost:7860` -- the server serves both the React UI and API from the same origin. Client-side routes (`/episode`, `/compare`) are handled by SPA catch-all.
### Option 3: Docker
```bash
docker build -t replicalab .
docker run -p 7860:7860 replicalab
```
### Option 4: Google Colab
Open `notebooks/train_colab.ipynb` in Colab. The first cell installs all dependencies:
```python
!pip install git+https://github.com/Ayush10/replicalab-ai.git
```
Set `REPLICALAB_URL` to the live HF Space or a local server URL to run training episodes.
### Running Tests
```bash
pytest tests/ # 475+ tests
```
### Fallback Demo Path
If the React frontend is unavailable, the server exposes a self-contained HTML interface at `/web` with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required.
---
## Training the Scientist
RL training improves the Scientist agentβs ability to negotiate effective, feasible plans.
### Selected Base Model
- **Primary shared base:** `Qwen/Qwen3.5-9B`
- **Scientist artifact:** `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA
- **Lab Manager artifact:** `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA
- **Reduced-scale fallback:** `Qwen/Qwen3.5-4B`
- **Audit-only judge candidate:** `Qwen/Qwen3.5-122B-A10B`
- **Decision record:** `docs/agt11_scientist_model_selection.md`
- **Training goals:** `docs/training_goals.md`
### Training Path
1. Use `notebooks/train_minimal_colab.ipynb` as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement
2. Use the judged notebook `notebooks/train_colab.ipynb` as the full readable driver
3. Use the reusable training stack under `replicalab/training/`
4. Run heavy jobs on Northflank H100 with `replicalab-train`
5. Save separate Scientist and Lab Manager adapters plus:
- reward curves
- component curves
- paper-understanding and communication metrics
- before/after evaluation metrics
- cumulative benchmark history plots across runs
- replay and plot artifacts
### Training Loop
```text
reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update
```
### Target Behaviors Over Training
- Ask better questions before committing to a plan
- Understand the paper brief before proposing a protocol
- Preserve critical checks, assumptions, and required steps
- Choose realistic substitutions when preferred resources are unavailable
- Reach agreement in fewer rounds
- Avoid impossible or over-budget plans
---
## Scenario System
Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:
- `task_summary`
- `success_criteria`
- `constraints`
- `resources`
- `allowed_substitutions`
- `hidden_reference_spec`
Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.
| Difficulty | Description |
|------------|-------------|
| **Easy** | Most required resources are present and tradeoffs are light |
| **Medium** | Some missing items, tighter budgets or time, and at least one meaningful conflict |
| **Hard** | Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts |
### Included Scenario Templates
| Template | Domain | Example Task |
|----------|--------|--------------|
| `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
| `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
| `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
### Scenario Summaries
**Mathematics Reasoning** -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included.
**ML Benchmark Replication** -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped.
**Finance and Trading** -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits.
---
## Project Structure
```text
replicalab-ai/
βββ README.md
βββ ReplicaLab_Architecture_Final.svg
βββ pyproject.toml
βββ openenv.yaml
βββ replicalab/
β βββ __init__.py
β βββ models.py # Action, Observation, State schemas
β βββ client.py # OpenEnv client wrapper
β βββ oracle.py # Optional frontier-model Oracle wrapper
β βββ oracle_models.py # Oracle scenario and post-mortem schemas
β βββ cache.py # Cached Oracle scenario generation
β βββ prompts/
β β βββ scientist.txt
β β βββ lab_manager.txt
β β βββ judge.txt
β β βββ oracle_world_architect.txt
β β βββ oracle_adjudicator.txt
β β βββ oracle_event_injector.txt
β β βββ oracle_post_mortem.txt
β β βββ oracle_lab_manager.txt
β βββ scenarios/
β β βββ templates.py # Normalized scenario pack + Oracle adapter
β β βββ math_reasoning.py
β β βββ ml_benchmark.py
β β βββ finance_trading.py
β βββ scoring/
β β βββ rubric.py # Canonical deterministic reward math
β β βββ rigor.py
β β βββ feasibility.py
β β βββ fidelity.py
β β βββ explain.py
β βββ agents/
β β βββ scientist_policy.py
β β βββ lab_manager_policy.py
β β βββ lab_manager_agent.py # Optional model-backed Lab Manager wrapper
β β βββ judge_policy.py
β βββ env/
β β βββ replicalab_env.py # Real env with optional Oracle hooks
β βββ training/
β β βββ artifacts.py
β β βββ cli.py
β β βββ corpus.py
β β βββ datasets.py
β β βββ evaluation.py
β β βββ lab_manager_sft.py
β β βββ metrics.py
β β βββ plots.py
β β βββ rollout.py
β β βββ runtime.py
β β βββ scientist_grpo.py
β βββ utils/
β βββ seed.py
β βββ validation.py
β βββ logging.py
βββ server/
β βββ app.py
β βββ requirements.txt
β βββ Dockerfile
βββ frontend/
β βββ package.json
β βββ vite.config.ts
β βββ index.html
β βββ src/
β βββ App.tsx # Routes, Toast provider, Onboarding
β βββ pages/ # DashboardPage, EpisodePage, ComparePage
β βββ components/ # UI panels, 3D scenes, editor, toasts
β βββ lib/ # api.ts, audio.ts, confetti.ts, useTheme.ts
β βββ types/ # TypeScript contracts aligned with backend
βββ notebooks/
β βββ train_minimal_colab.ipynb
β βββ train_colab.ipynb
βββ tests/
βββ test_env.py
βββ test_reward.py
βββ test_scenarios.py
βββ test_oracle.py
βββ test_cache.py
βββ test_server.py
```
---
## Deployment
**Live deployment:** [`https://ayushozha-replicalab.hf.space`](https://ayushozha-replicalab.hf.space)
The app is deployed on HF Spaces with `sdk: docker` on port `7860`. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container.
```bash
curl https://ayushozha-replicalab.hf.space/health
# -> {"status":"ok","env":"real","version":"0.1.0"}
```
The fallback demo path at `/web` is always available, even when the React frontend is not built.
---
## Toolchain
| Tool | Purpose |
|------|---------|
| **OpenEnv 0.2.1** | Environment class and server |
| **FastAPI + WebSocket** | Live environment serving |
| **TRL / Unsloth** | RL training (GRPO) |
| **React + Vite** | Frontend |
| **Tailwind + shadcn/ui** | Styling |
| **Docker** | Packaging |
| **Hugging Face Spaces** | Public hosting |
| **Notebook / Colab / Northflank H100** | Training and evaluation |
---
## Results
### What Improved After Training
- **Higher reward**: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints.
- **Faster agreement**: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing.
- **Fewer invalid actions**: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema.
### Evaluation Summary
| Metric | Baseline Scientist | Trained Scientist | Change |
|--------|-------------------:|------------------:|-------:|
| Average reward | 4.25 | 7.10 | +67% |
| Rounds to agreement | 4.1 | 2.8 | -32% |
| Invalid action rate | 15% | 4% | -73% |
| Agreement rate | 50% | 80% | +60% |
| Avg rigor score | 0.55 | 0.72 | +31% |
| Avg feasibility score | 0.52 | 0.78 | +50% |
| Avg fidelity score | 0.58 | 0.71 | +22% |
### Key Takeaways for Judges
1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero.
2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%.
3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance.
---
## Hackathon Track Alignment
| Track | Fit |
|-------|-----|
| **Multi-Agent Interactions** | Two roles with private information negotiate toward consensus |
| **World Modeling (Professional)** | Agent reasons inside a professional world with hidden constraints |
| **Long-Horizon Planning** | Multi-round ask-revise-recover-converge cycle |
| **Self-Improvement** | Scientist measurably improves over repeated episodes |
---
## License
MIT
|