--- title: ReplicaLab emoji: "πŸ§ͺ" colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false --- # ReplicaLab **A multi-agent constraint-aware planning environment built on [OpenEnv](https://github.com/openenv)** > *Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.* ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality. Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable. ## Team Ownership | Owner | Current focus | |------|----------------| | Kian (Person A) | Shared schemas, validation, scenario engine, judge logic | | Person B (Ayush) | Scientist prompting and parsing, notebook and client path | | Max (Person C) | Server, deployment, and runtime plumbing | | Kush (Person D) | Frontend, UI polish, docs, and demo assets | --- ## Architecture

ReplicaLab Final System Architecture

ReplicaLab uses a **hybrid Oracle architecture**: - The **Oracle layer** is optional and powers world-building and narrative intelligence: - richer scenario generation - optional event injection - optional model-backed Lab Manager narration - optional post-mortem analysis - The **deterministic core** remains canonical for RL: - environment transitions - validation - grounded Lab Manager feasibility - judge scoring and reward math This satisfies the sponsor-facing β€œmodel-driven environment intelligence” direction without making reward noisy or irreproducible. --- ## How It Works Each episode simulates a negotiation between two agents inside a constrained technical scenario: | Role | Type | Responsibility | |------|------|----------------| | **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality | | **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth | | **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony | | **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis | ### Episode Lifecycle 1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec. 2. **Scientist observes**: task summary, goal, history, and current plan. 3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack. 4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions. 5. **Agreement or timeout**: both accept, or the round limit is reached. 6. **Reward**: the deterministic judge scores the final plan. 7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward. ### Reward Formula ```text total_reward = 10 * rigor * feasibility * fidelity * parsimony + efficiency_bonus + communication_bonus - penalties ``` The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation. ### Internal Normalization Rule The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain. --- ## Getting Started ### Prerequisites - Python 3.10+ - Node.js 18+ - Docker (optional, for containerized deployment) ### Option 1: Local Development ```bash git clone https://github.com/Ayush10/replicalab-ai.git cd replicalab-ai python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -e ".[dev]" ``` Start the backend: ```bash python -m server.app ``` The server starts at `http://localhost:7860`. Visit `/web` for the built-in fallback UI, or start the full React frontend: ```bash cd frontend && npm install && npm run dev ``` The Vite dev server starts at `http://localhost:5173` and proxies `/api` and `/ws` to the backend. ### Option 2: Production Build (Single Server) ```bash cd frontend && npm install && npm run build && cd .. python -m server.app ``` Open `http://localhost:7860` -- the server serves both the React UI and API from the same origin. Client-side routes (`/episode`, `/compare`) are handled by SPA catch-all. ### Option 3: Docker ```bash docker build -t replicalab . docker run -p 7860:7860 replicalab ``` ### Option 4: Google Colab Open `notebooks/train_colab.ipynb` in Colab. The first cell installs all dependencies: ```python !pip install git+https://github.com/Ayush10/replicalab-ai.git ``` Set `REPLICALAB_URL` to the live HF Space or a local server URL to run training episodes. ### Running Tests ```bash pytest tests/ # 475+ tests ``` ### Fallback Demo Path If the React frontend is unavailable, the server exposes a self-contained HTML interface at `/web` with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required. --- ## Training the Scientist RL training improves the Scientist agent’s ability to negotiate effective, feasible plans. ### Selected Base Model - **Primary shared base:** `Qwen/Qwen3.5-9B` - **Scientist artifact:** `Qwen/Qwen3.5-9B` + Unsloth GRPO LoRA - **Lab Manager artifact:** `Qwen/Qwen3.5-9B` + Unsloth SFT LoRA - **Reduced-scale fallback:** `Qwen/Qwen3.5-4B` - **Audit-only judge candidate:** `Qwen/Qwen3.5-122B-A10B` - **Decision record:** `docs/agt11_scientist_model_selection.md` - **Training goals:** `docs/training_goals.md` ### Training Path 1. Use `notebooks/train_minimal_colab.ipynb` as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement 2. Use the judged notebook `notebooks/train_colab.ipynb` as the full readable driver 3. Use the reusable training stack under `replicalab/training/` 4. Run heavy jobs on Northflank H100 with `replicalab-train` 5. Save separate Scientist and Lab Manager adapters plus: - reward curves - component curves - paper-understanding and communication metrics - before/after evaluation metrics - cumulative benchmark history plots across runs - replay and plot artifacts ### Training Loop ```text reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update ``` ### Target Behaviors Over Training - Ask better questions before committing to a plan - Understand the paper brief before proposing a protocol - Preserve critical checks, assumptions, and required steps - Choose realistic substitutions when preferred resources are unavailable - Reach agreement in fewer rounds - Avoid impossible or over-budget plans --- ## Scenario System Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with: - `task_summary` - `success_criteria` - `constraints` - `resources` - `allowed_substitutions` - `hidden_reference_spec` Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure. | Difficulty | Description | |------------|-------------| | **Easy** | Most required resources are present and tradeoffs are light | | **Medium** | Some missing items, tighter budgets or time, and at least one meaningful conflict | | **Hard** | Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts | ### Included Scenario Templates | Template | Domain | Example Task | |----------|--------|--------------| | `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints | | `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints | | `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits | ### Scenario Summaries **Mathematics Reasoning** -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included. **ML Benchmark Replication** -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped. **Finance and Trading** -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits. --- ## Project Structure ```text replicalab-ai/ β”œβ”€β”€ README.md β”œβ”€β”€ ReplicaLab_Architecture_Final.svg β”œβ”€β”€ pyproject.toml β”œβ”€β”€ openenv.yaml β”œβ”€β”€ replicalab/ β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ models.py # Action, Observation, State schemas β”‚ β”œβ”€β”€ client.py # OpenEnv client wrapper β”‚ β”œβ”€β”€ oracle.py # Optional frontier-model Oracle wrapper β”‚ β”œβ”€β”€ oracle_models.py # Oracle scenario and post-mortem schemas β”‚ β”œβ”€β”€ cache.py # Cached Oracle scenario generation β”‚ β”œβ”€β”€ prompts/ β”‚ β”‚ β”œβ”€β”€ scientist.txt β”‚ β”‚ β”œβ”€β”€ lab_manager.txt β”‚ β”‚ β”œβ”€β”€ judge.txt β”‚ β”‚ β”œβ”€β”€ oracle_world_architect.txt β”‚ β”‚ β”œβ”€β”€ oracle_adjudicator.txt β”‚ β”‚ β”œβ”€β”€ oracle_event_injector.txt β”‚ β”‚ β”œβ”€β”€ oracle_post_mortem.txt β”‚ β”‚ └── oracle_lab_manager.txt β”‚ β”œβ”€β”€ scenarios/ β”‚ β”‚ β”œβ”€β”€ templates.py # Normalized scenario pack + Oracle adapter β”‚ β”‚ β”œβ”€β”€ math_reasoning.py β”‚ β”‚ β”œβ”€β”€ ml_benchmark.py β”‚ β”‚ └── finance_trading.py β”‚ β”œβ”€β”€ scoring/ β”‚ β”‚ β”œβ”€β”€ rubric.py # Canonical deterministic reward math β”‚ β”‚ β”œβ”€β”€ rigor.py β”‚ β”‚ β”œβ”€β”€ feasibility.py β”‚ β”‚ β”œβ”€β”€ fidelity.py β”‚ β”‚ └── explain.py β”‚ β”œβ”€β”€ agents/ β”‚ β”‚ β”œβ”€β”€ scientist_policy.py β”‚ β”‚ β”œβ”€β”€ lab_manager_policy.py β”‚ β”‚ β”œβ”€β”€ lab_manager_agent.py # Optional model-backed Lab Manager wrapper β”‚ β”‚ └── judge_policy.py β”‚ β”œβ”€β”€ env/ β”‚ β”‚ └── replicalab_env.py # Real env with optional Oracle hooks β”‚ β”œβ”€β”€ training/ β”‚ β”‚ β”œβ”€β”€ artifacts.py β”‚ β”‚ β”œβ”€β”€ cli.py β”‚ β”‚ β”œβ”€β”€ corpus.py β”‚ β”‚ β”œβ”€β”€ datasets.py β”‚ β”‚ β”œβ”€β”€ evaluation.py β”‚ β”‚ β”œβ”€β”€ lab_manager_sft.py β”‚ β”‚ β”œβ”€β”€ metrics.py β”‚ β”‚ β”œβ”€β”€ plots.py β”‚ β”‚ β”œβ”€β”€ rollout.py β”‚ β”‚ β”œβ”€β”€ runtime.py β”‚ β”‚ └── scientist_grpo.py β”‚ └── utils/ β”‚ β”œβ”€β”€ seed.py β”‚ β”œβ”€β”€ validation.py β”‚ └── logging.py β”œβ”€β”€ server/ β”‚ β”œβ”€β”€ app.py β”‚ β”œβ”€β”€ requirements.txt β”‚ └── Dockerfile β”œβ”€β”€ frontend/ β”‚ β”œβ”€β”€ package.json β”‚ β”œβ”€β”€ vite.config.ts β”‚ β”œβ”€β”€ index.html β”‚ └── src/ β”‚ β”œβ”€β”€ App.tsx # Routes, Toast provider, Onboarding β”‚ β”œβ”€β”€ pages/ # DashboardPage, EpisodePage, ComparePage β”‚ β”œβ”€β”€ components/ # UI panels, 3D scenes, editor, toasts β”‚ β”œβ”€β”€ lib/ # api.ts, audio.ts, confetti.ts, useTheme.ts β”‚ └── types/ # TypeScript contracts aligned with backend β”œβ”€β”€ notebooks/ β”‚ β”œβ”€β”€ train_minimal_colab.ipynb β”‚ └── train_colab.ipynb └── tests/ β”œβ”€β”€ test_env.py β”œβ”€β”€ test_reward.py β”œβ”€β”€ test_scenarios.py β”œβ”€β”€ test_oracle.py β”œβ”€β”€ test_cache.py └── test_server.py ``` --- ## Deployment **Live deployment:** [`https://ayushozha-replicalab.hf.space`](https://ayushozha-replicalab.hf.space) The app is deployed on HF Spaces with `sdk: docker` on port `7860`. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container. ```bash curl https://ayushozha-replicalab.hf.space/health # -> {"status":"ok","env":"real","version":"0.1.0"} ``` The fallback demo path at `/web` is always available, even when the React frontend is not built. --- ## Toolchain | Tool | Purpose | |------|---------| | **OpenEnv 0.2.1** | Environment class and server | | **FastAPI + WebSocket** | Live environment serving | | **TRL / Unsloth** | RL training (GRPO) | | **React + Vite** | Frontend | | **Tailwind + shadcn/ui** | Styling | | **Docker** | Packaging | | **Hugging Face Spaces** | Public hosting | | **Notebook / Colab / Northflank H100** | Training and evaluation | --- ## Results ### What Improved After Training - **Higher reward**: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints. - **Faster agreement**: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing. - **Fewer invalid actions**: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema. ### Evaluation Summary | Metric | Baseline Scientist | Trained Scientist | Change | |--------|-------------------:|------------------:|-------:| | Average reward | 4.25 | 7.10 | +67% | | Rounds to agreement | 4.1 | 2.8 | -32% | | Invalid action rate | 15% | 4% | -73% | | Agreement rate | 50% | 80% | +60% | | Avg rigor score | 0.55 | 0.72 | +31% | | Avg feasibility score | 0.52 | 0.78 | +50% | | Avg fidelity score | 0.58 | 0.71 | +22% | ### Key Takeaways for Judges 1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero. 2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%. 3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance. --- ## Hackathon Track Alignment | Track | Fit | |-------|-----| | **Multi-Agent Interactions** | Two roles with private information negotiate toward consensus | | **World Modeling (Professional)** | Agent reasons inside a professional world with hidden constraints | | **Long-Horizon Planning** | Multi-round ask-revise-recover-converge cycle | | **Self-Improvement** | Scientist measurably improves over repeated episodes | --- ## License MIT