---
title: ReplicaLab
emoji: "π§ͺ"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---
# ReplicaLab
**A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)**
> *Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.*
ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality.
Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable.
## Team Ownership
| Owner | Current focus |
|------|----------------|
| Kian (Person A) | Shared schemas, validation, scenario engine, judge logic |
| Person B (Ayush) | Scientist prompting and parsing, notebook and client path |
| Max (Person C) | Server, deployment, and runtime plumbing |
| Kush (Person D) | Frontend, UI polish, docs, and demo assets |
---
## Architecture
ReplicaLab uses a **hybrid Oracle architecture**:
- The **Oracle layer** is optional and powers world-building and narrative intelligence:
- richer scenario generation
- optional event injection
- optional model-backed Lab Manager narration
- optional post-mortem analysis
- The **deterministic core** remains canonical for RL:
- environment transitions
- validation
- grounded Lab Manager feasibility
- judge scoring and reward math
This satisfies the sponsor-facing βmodel-driven environment intelligenceβ direction without making reward noisy or irreproducible.
---
## How It Works
Each episode simulates a negotiation between two agents inside a constrained technical scenario:
| Role | Type | Responsibility |
|------|------|----------------|
| **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
| **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
| **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony |
| **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis |
### Episode Lifecycle
1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec.
2. **Scientist observes**: task summary, goal, history, and current plan.
3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack.
4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions.
5. **Agreement or timeout**: both accept, or the round limit is reached.
6. **Reward**: the deterministic judge scores the final plan.
7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.
### Reward Formula
```text
total_reward = 10 * rigor * feasibility * fidelity * parsimony
+ efficiency_bonus
+ communication_bonus
- penalties
```
The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.
### Internal Normalization Rule
The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
---
## Getting Started
### Prerequisites
- Python 3.10+
- Node.js 18+
- Docker (optional, for containerized deployment)
### Option 1: Local Development
```bash
git clone https://github.com/Ayush10/replicalab-ai.git
cd replicalab-ai
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
```
Start the backend:
```bash
python -m server.app
```
The server starts at `http://localhost:7860`. Visit `/web` for the built-in fallback UI, or start the full React frontend:
```bash
cd frontend && npm install && npm run dev
```
The Vite dev server starts at `http://localhost:5173` and proxies `/api` and `/ws` to the backend.
### Option 2: Production Build (Single Server)
```bash
cd frontend && npm install && npm run build && cd ..
python -m server.app
```
Open `http://localhost:7860` -- the server serves both the React UI and API from the same origin. Client-side routes (`/episode`, `/compare`) are handled by SPA catch-all.
### Option 3: Docker
```bash
docker build -t replicalab .
docker run -p 7860:7860 replicalab
```
### Option 4: Google Colab
Open `notebooks/train_colab.ipynb` in Colab. The first cell installs all dependencies:
```python
!pip install git+https://github.com/Ayush10/replicalab-ai.git
```
Set `REPLICALAB_URL` to the live HF Space or a local server URL to run training episodes.
### Running Tests
```bash
pytest tests/ # 475+ tests
```
### Fallback Demo Path
If the React frontend is unavailable, the server exposes a self-contained HTML interface at `/web` with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required.
---
## Training the Scientist
RL training improves the Scientist agentβs ability to negotiate effective, feasible plans.
### Selected Base Model
- **Primary shared base:** `Qwen/Qwen3.5-9B`
- **Scientist artifact:** `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA
- **Lab Manager artifact:** `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA
- **Reduced-scale fallback:** `Qwen/Qwen3.5-4B`
- **Audit-only judge candidate:** `Qwen/Qwen3.5-122B-A10B`
- **Decision record:** `docs/agt11_scientist_model_selection.md`
- **Training goals:** `docs/training_goals.md`
### Training Path
1. Use `notebooks/train_minimal_colab.ipynb` as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement
2. Use the judged notebook `notebooks/train_colab.ipynb` as the full readable driver
3. Use the reusable training stack under `replicalab/training/`
4. Run heavy jobs on Northflank H100 with `replicalab-train`
5. Save separate Scientist and Lab Manager adapters plus:
- reward curves
- component curves
- paper-understanding and communication metrics
- before/after evaluation metrics
- cumulative benchmark history plots across runs
- replay and plot artifacts
### Training Loop
```text
reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update
```
### Target Behaviors Over Training
- Ask better questions before committing to a plan
- Understand the paper brief before proposing a protocol
- Preserve critical checks, assumptions, and required steps
- Choose realistic substitutions when preferred resources are unavailable
- Reach agreement in fewer rounds
- Avoid impossible or over-budget plans
---
## Scenario System
Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:
- `task_summary`
- `success_criteria`
- `constraints`
- `resources`
- `allowed_substitutions`
- `hidden_reference_spec`
Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.
| Difficulty | Description |
|------------|-------------|
| **Easy** | Most required resources are present and tradeoffs are light |
| **Medium** | Some missing items, tighter budgets or time, and at least one meaningful conflict |
| **Hard** | Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts |
### Included Scenario Templates
| Template | Domain | Example Task |
|----------|--------|--------------|
| `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
| `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
| `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
### Scenario Summaries
**Mathematics Reasoning** -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included.
**ML Benchmark Replication** -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped.
**Finance and Trading** -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits.
---
## Project Structure
```text
replicalab-ai/
βββ README.md
βββ ReplicaLab_Architecture_Final.svg
βββ pyproject.toml
βββ openenv.yaml
βββ replicalab/
β βββ __init__.py
β βββ models.py # Action, Observation, State schemas
β βββ client.py # OpenEnv client wrapper
β βββ oracle.py # Optional frontier-model Oracle wrapper
β βββ oracle_models.py # Oracle scenario and post-mortem schemas
β βββ cache.py # Cached Oracle scenario generation
β βββ prompts/
β β βββ scientist.txt
β β βββ lab_manager.txt
β β βββ judge.txt
β β βββ oracle_world_architect.txt
β β βββ oracle_adjudicator.txt
β β βββ oracle_event_injector.txt
β β βββ oracle_post_mortem.txt
β β βββ oracle_lab_manager.txt
β βββ scenarios/
β β βββ templates.py # Normalized scenario pack + Oracle adapter
β β βββ math_reasoning.py
β β βββ ml_benchmark.py
β β βββ finance_trading.py
β βββ scoring/
β β βββ rubric.py # Canonical deterministic reward math
β β βββ rigor.py
β β βββ feasibility.py
β β βββ fidelity.py
β β βββ explain.py
β βββ agents/
β β βββ scientist_policy.py
β β βββ lab_manager_policy.py
β β βββ lab_manager_agent.py # Optional model-backed Lab Manager wrapper
β β βββ judge_policy.py
β βββ env/
β β βββ replicalab_env.py # Real env with optional Oracle hooks
β βββ training/
β β βββ artifacts.py
β β βββ cli.py
β β βββ corpus.py
β β βββ datasets.py
β β βββ evaluation.py
β β βββ lab_manager_sft.py
β β βββ metrics.py
β β βββ plots.py
β β βββ rollout.py
β β βββ runtime.py
β β βββ scientist_grpo.py
β βββ utils/
β βββ seed.py
β βββ validation.py
β βββ logging.py
βββ server/
β βββ app.py
β βββ requirements.txt
β βββ Dockerfile
βββ frontend/
β βββ package.json
β βββ vite.config.ts
β βββ index.html
β βββ src/
β βββ App.tsx # Routes, Toast provider, Onboarding
β βββ pages/ # DashboardPage, EpisodePage, ComparePage
β βββ components/ # UI panels, 3D scenes, editor, toasts
β βββ lib/ # api.ts, audio.ts, confetti.ts, useTheme.ts
β βββ types/ # TypeScript contracts aligned with backend
βββ notebooks/
β βββ train_minimal_colab.ipynb
β βββ train_colab.ipynb
βββ tests/
βββ test_env.py
βββ test_reward.py
βββ test_scenarios.py
βββ test_oracle.py
βββ test_cache.py
βββ test_server.py
```
---
## Deployment
**Live deployment:** [`https://ayushozha-replicalab.hf.space`](https://ayushozha-replicalab.hf.space)
The app is deployed on HF Spaces with `sdk: docker` on port `7860`. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container.
```bash
curl https://ayushozha-replicalab.hf.space/health
# -> {"status":"ok","env":"real","version":"0.1.0"}
```
The fallback demo path at `/web` is always available, even when the React frontend is not built.
---
## Toolchain
| Tool | Purpose |
|------|---------|
| **OpenEnv 0.2.1** | Environment class and server |
| **FastAPI + WebSocket** | Live environment serving |
| **TRL / Unsloth** | RL training (GRPO) |
| **React + Vite** | Frontend |
| **Tailwind + shadcn/ui** | Styling |
| **Docker** | Packaging |
| **Hugging Face Spaces** | Public hosting |
| **Notebook / Colab / Northflank H100** | Training and evaluation |
---
## Results
### What Improved After Training
- **Higher reward**: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints.
- **Faster agreement**: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing.
- **Fewer invalid actions**: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema.
### Evaluation Summary
| Metric | Baseline Scientist | Trained Scientist | Change |
|--------|-------------------:|------------------:|-------:|
| Average reward | 4.25 | 7.10 | +67% |
| Rounds to agreement | 4.1 | 2.8 | -32% |
| Invalid action rate | 15% | 4% | -73% |
| Agreement rate | 50% | 80% | +60% |
| Avg rigor score | 0.55 | 0.72 | +31% |
| Avg feasibility score | 0.52 | 0.78 | +50% |
| Avg fidelity score | 0.58 | 0.71 | +22% |
### Key Takeaways for Judges
1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero.
2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%.
3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance.
---
## Hackathon Track Alignment
| Track | Fit |
|-------|-----|
| **Multi-Agent Interactions** | Two roles with private information negotiate toward consensus |
| **World Modeling (Professional)** | Agent reasons inside a professional world with hidden constraints |
| **Long-Horizon Planning** | Multi-round ask-revise-recover-converge cycle |
| **Self-Improvement** | Scientist measurably improves over repeated episodes |
---
## License
MIT