Spaces:
Running
Running
| title: ReplicaLab | |
| emoji: "π§ͺ" | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # ReplicaLab | |
| **A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)** | |
| > *Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.* | |
| ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality. | |
| Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable. | |
| ## Team Ownership | |
| | Owner | Current focus | | |
| |------|----------------| | |
| | Kian (Person A) | Shared schemas, validation, scenario engine, judge logic | | |
| | Person B (Ayush) | Scientist prompting and parsing, notebook and client path | | |
| | Max (Person C) | Server, deployment, and runtime plumbing | | |
| | Kush (Person D) | Frontend, UI polish, docs, and demo assets | | |
| --- | |
| ## Architecture | |
| <p align="center"> | |
| <img src="./ReplicaLab_Architecture_Final.svg" alt="ReplicaLab Final System Architecture" width="100%"/> | |
| </p> | |
| ReplicaLab uses a **hybrid Oracle architecture**: | |
| - The **Oracle layer** is optional and powers world-building and narrative intelligence: | |
| - richer scenario generation | |
| - optional event injection | |
| - optional model-backed Lab Manager narration | |
| - optional post-mortem analysis | |
| - The **deterministic core** remains canonical for RL: | |
| - environment transitions | |
| - validation | |
| - grounded Lab Manager feasibility | |
| - judge scoring and reward math | |
| This satisfies the sponsor-facing βmodel-driven environment intelligenceβ direction without making reward noisy or irreproducible. | |
| --- | |
| ## How It Works | |
| Each episode simulates a negotiation between two agents inside a constrained technical scenario: | |
| | Role | Type | Responsibility | | |
| |------|------|----------------| | |
| | **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality | | |
| | **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth | | |
| | **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony | | |
| | **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis | | |
| ### Episode Lifecycle | |
| 1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec. | |
| 2. **Scientist observes**: task summary, goal, history, and current plan. | |
| 3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack. | |
| 4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions. | |
| 5. **Agreement or timeout**: both accept, or the round limit is reached. | |
| 6. **Reward**: the deterministic judge scores the final plan. | |
| 7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward. | |
| ### Reward Formula | |
| ```text | |
| total_reward = 10 * rigor * feasibility * fidelity * parsimony | |
| + efficiency_bonus | |
| + communication_bonus | |
| - penalties | |
| ``` | |
| The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation. | |
| ### Internal Normalization Rule | |
| The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain. | |
| --- | |
| ## Getting Started | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Node.js 18+ | |
| - Docker (optional, for containerized deployment) | |
| ### Option 1: Local Development | |
| ```bash | |
| git clone https://github.com/Ayush10/replicalab-ai.git | |
| cd replicalab-ai | |
| python -m venv .venv | |
| source .venv/bin/activate # Windows: .venv\Scripts\activate | |
| pip install -e ".[dev]" | |
| ``` | |
| Start the backend: | |
| ```bash | |
| python -m server.app | |
| ``` | |
| The server starts at `http://localhost:7860`. Visit `/web` for the built-in fallback UI, or start the full React frontend: | |
| ```bash | |
| cd frontend && npm install && npm run dev | |
| ``` | |
| The Vite dev server starts at `http://localhost:5173` and proxies `/api` and `/ws` to the backend. | |
| ### Option 2: Production Build (Single Server) | |
| ```bash | |
| cd frontend && npm install && npm run build && cd .. | |
| python -m server.app | |
| ``` | |
| Open `http://localhost:7860` -- the server serves both the React UI and API from the same origin. Client-side routes (`/episode`, `/compare`) are handled by SPA catch-all. | |
| ### Option 3: Docker | |
| ```bash | |
| docker build -t replicalab . | |
| docker run -p 7860:7860 replicalab | |
| ``` | |
| ### Option 4: Google Colab | |
| Open `notebooks/train_colab.ipynb` in Colab. The first cell installs all dependencies: | |
| ```python | |
| !pip install git+https://github.com/Ayush10/replicalab-ai.git | |
| ``` | |
| Set `REPLICALAB_URL` to the live HF Space or a local server URL to run training episodes. | |
| ### Running Tests | |
| ```bash | |
| pytest tests/ # 475+ tests | |
| ``` | |
| ### Fallback Demo Path | |
| If the React frontend is unavailable, the server exposes a self-contained HTML interface at `/web` with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required. | |
| --- | |
| ## Training the Scientist | |
| RL training improves the Scientist agentβs ability to negotiate effective, feasible plans. | |
| ### Selected Base Model | |
| - **Primary shared base:** `Qwen/Qwen3.5-9B` | |
| - **Scientist artifact:** `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA | |
| - **Lab Manager artifact:** `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA | |
| - **Reduced-scale fallback:** `Qwen/Qwen3.5-4B` | |
| - **Audit-only judge candidate:** `Qwen/Qwen3.5-122B-A10B` | |
| - **Decision record:** `docs/agt11_scientist_model_selection.md` | |
| - **Training goals:** `docs/training_goals.md` | |
| ### Training Path | |
| 1. Use `notebooks/train_minimal_colab.ipynb` as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement | |
| 2. Use the judged notebook `notebooks/train_colab.ipynb` as the full readable driver | |
| 3. Use the reusable training stack under `replicalab/training/` | |
| 4. Run heavy jobs on Northflank H100 with `replicalab-train` | |
| 5. Save separate Scientist and Lab Manager adapters plus: | |
| - reward curves | |
| - component curves | |
| - paper-understanding and communication metrics | |
| - before/after evaluation metrics | |
| - cumulative benchmark history plots across runs | |
| - replay and plot artifacts | |
| ### Training Loop | |
| ```text | |
| reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update | |
| ``` | |
| ### Target Behaviors Over Training | |
| - Ask better questions before committing to a plan | |
| - Understand the paper brief before proposing a protocol | |
| - Preserve critical checks, assumptions, and required steps | |
| - Choose realistic substitutions when preferred resources are unavailable | |
| - Reach agreement in fewer rounds | |
| - Avoid impossible or over-budget plans | |
| --- | |
| ## Scenario System | |
| Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with: | |
| - `task_summary` | |
| - `success_criteria` | |
| - `constraints` | |
| - `resources` | |
| - `allowed_substitutions` | |
| - `hidden_reference_spec` | |
| Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure. | |
| | Difficulty | Description | | |
| |------------|-------------| | |
| | **Easy** | Most required resources are present and tradeoffs are light | | |
| | **Medium** | Some missing items, tighter budgets or time, and at least one meaningful conflict | | |
| | **Hard** | Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts | | |
| ### Included Scenario Templates | |
| | Template | Domain | Example Task | | |
| |----------|--------|--------------| | |
| | `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints | | |
| | `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints | | |
| | `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits | | |
| ### Scenario Summaries | |
| **Mathematics Reasoning** -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included. | |
| **ML Benchmark Replication** -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped. | |
| **Finance and Trading** -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits. | |
| --- | |
| ## Project Structure | |
| ```text | |
| replicalab-ai/ | |
| βββ README.md | |
| βββ ReplicaLab_Architecture_Final.svg | |
| βββ pyproject.toml | |
| βββ openenv.yaml | |
| βββ replicalab/ | |
| β βββ __init__.py | |
| β βββ models.py # Action, Observation, State schemas | |
| β βββ client.py # OpenEnv client wrapper | |
| β βββ oracle.py # Optional frontier-model Oracle wrapper | |
| β βββ oracle_models.py # Oracle scenario and post-mortem schemas | |
| β βββ cache.py # Cached Oracle scenario generation | |
| β βββ prompts/ | |
| β β βββ scientist.txt | |
| β β βββ lab_manager.txt | |
| β β βββ judge.txt | |
| β β βββ oracle_world_architect.txt | |
| β β βββ oracle_adjudicator.txt | |
| β β βββ oracle_event_injector.txt | |
| β β βββ oracle_post_mortem.txt | |
| β β βββ oracle_lab_manager.txt | |
| β βββ scenarios/ | |
| β β βββ templates.py # Normalized scenario pack + Oracle adapter | |
| β β βββ math_reasoning.py | |
| β β βββ ml_benchmark.py | |
| β β βββ finance_trading.py | |
| β βββ scoring/ | |
| β β βββ rubric.py # Canonical deterministic reward math | |
| β β βββ rigor.py | |
| β β βββ feasibility.py | |
| β β βββ fidelity.py | |
| β β βββ explain.py | |
| β βββ agents/ | |
| β β βββ scientist_policy.py | |
| β β βββ lab_manager_policy.py | |
| β β βββ lab_manager_agent.py # Optional model-backed Lab Manager wrapper | |
| β β βββ judge_policy.py | |
| β βββ env/ | |
| β β βββ replicalab_env.py # Real env with optional Oracle hooks | |
| β βββ training/ | |
| β β βββ artifacts.py | |
| β β βββ cli.py | |
| β β βββ corpus.py | |
| β β βββ datasets.py | |
| β β βββ evaluation.py | |
| β β βββ lab_manager_sft.py | |
| β β βββ metrics.py | |
| β β βββ plots.py | |
| β β βββ rollout.py | |
| β β βββ runtime.py | |
| β β βββ scientist_grpo.py | |
| β βββ utils/ | |
| β βββ seed.py | |
| β βββ validation.py | |
| β βββ logging.py | |
| βββ server/ | |
| β βββ app.py | |
| β βββ requirements.txt | |
| β βββ Dockerfile | |
| βββ frontend/ | |
| β βββ package.json | |
| β βββ vite.config.ts | |
| β βββ index.html | |
| β βββ src/ | |
| β βββ App.tsx # Routes, Toast provider, Onboarding | |
| β βββ pages/ # DashboardPage, EpisodePage, ComparePage | |
| β βββ components/ # UI panels, 3D scenes, editor, toasts | |
| β βββ lib/ # api.ts, audio.ts, confetti.ts, useTheme.ts | |
| β βββ types/ # TypeScript contracts aligned with backend | |
| βββ notebooks/ | |
| β βββ train_minimal_colab.ipynb | |
| β βββ train_colab.ipynb | |
| βββ tests/ | |
| βββ test_env.py | |
| βββ test_reward.py | |
| βββ test_scenarios.py | |
| βββ test_oracle.py | |
| βββ test_cache.py | |
| βββ test_server.py | |
| ``` | |
| --- | |
| ## Deployment | |
| **Live deployment:** [`https://ayushozha-replicalab.hf.space`](https://ayushozha-replicalab.hf.space) | |
| The app is deployed on HF Spaces with `sdk: docker` on port `7860`. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container. | |
| ```bash | |
| curl https://ayushozha-replicalab.hf.space/health | |
| # -> {"status":"ok","env":"real","version":"0.1.0"} | |
| ``` | |
| The fallback demo path at `/web` is always available, even when the React frontend is not built. | |
| --- | |
| ## Toolchain | |
| | Tool | Purpose | | |
| |------|---------| | |
| | **OpenEnv 0.2.1** | Environment class and server | | |
| | **FastAPI + WebSocket** | Live environment serving | | |
| | **TRL / Unsloth** | RL training (GRPO) | | |
| | **React + Vite** | Frontend | | |
| | **Tailwind + shadcn/ui** | Styling | | |
| | **Docker** | Packaging | | |
| | **Hugging Face Spaces** | Public hosting | | |
| | **Notebook / Colab / Northflank H100** | Training and evaluation | | |
| --- | |
| ## Results | |
| ### What Improved After Training | |
| - **Higher reward**: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints. | |
| - **Faster agreement**: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing. | |
| - **Fewer invalid actions**: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema. | |
| ### Evaluation Summary | |
| | Metric | Baseline Scientist | Trained Scientist | Change | | |
| |--------|-------------------:|------------------:|-------:| | |
| | Average reward | 4.25 | 7.10 | +67% | | |
| | Rounds to agreement | 4.1 | 2.8 | -32% | | |
| | Invalid action rate | 15% | 4% | -73% | | |
| | Agreement rate | 50% | 80% | +60% | | |
| | Avg rigor score | 0.55 | 0.72 | +31% | | |
| | Avg feasibility score | 0.52 | 0.78 | +50% | | |
| | Avg fidelity score | 0.58 | 0.71 | +22% | | |
| ### Key Takeaways for Judges | |
| 1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero. | |
| 2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%. | |
| 3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance. | |
| --- | |
| ## Hackathon Track Alignment | |
| | Track | Fit | | |
| |-------|-----| | |
| | **Multi-Agent Interactions** | Two roles with private information negotiate toward consensus | | |
| | **World Modeling (Professional)** | Agent reasons inside a professional world with hidden constraints | | |
| | **Long-Horizon Planning** | Multi-round ask-revise-recover-converge cycle | | |
| | **Self-Improvement** | Scientist measurably improves over repeated episodes | | |
| --- | |
| ## License | |
| MIT | |