replicalab / README.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84
metadata
title: ReplicaLab
emoji: πŸ§ͺ
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

ReplicaLab

A multi-agent constraint-aware planning environment built on OpenEnv

Over 70% of landmark studies fail to replicate. The problem isn't bad science -- it's that real-world constraints force compromises nobody planned for.

ReplicaLab tackles this by training an AI Scientist agent to negotiate feasible replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity. Through reinforcement learning, the Scientist learns to ask better questions, make smarter tradeoffs, and reach agreement faster -- all without sacrificing scientific quality.

Three scenario families ship today -- mathematics reasoning, ML benchmark replication, and offline finance/trading backtest design -- each with easy, medium, and hard difficulty scaling. Physics and biology remain future adapters after the core normalized scenario layer is stable.

Team Ownership

Owner Current focus
Kian (Person A) Shared schemas, validation, scenario engine, judge logic
Person B (Ayush) Scientist prompting and parsing, notebook and client path
Max (Person C) Server, deployment, and runtime plumbing
Kush (Person D) Frontend, UI polish, docs, and demo assets

Architecture

ReplicaLab Final System Architecture

ReplicaLab uses a hybrid Oracle architecture:

  • The Oracle layer is optional and powers world-building and narrative intelligence:
    • richer scenario generation
    • optional event injection
    • optional model-backed Lab Manager narration
    • optional post-mortem analysis
  • The deterministic core remains canonical for RL:
    • environment transitions
    • validation
    • grounded Lab Manager feasibility
    • judge scoring and reward math

This satisfies the sponsor-facing β€œmodel-driven environment intelligence” direction without making reward noisy or irreproducible.


How It Works

Each episode simulates a negotiation between two agents inside a constrained technical scenario:

Role Type Responsibility
Scientist Trainable model policy Proposes plans, asks questions, and preserves objective quality
Lab Manager Hybrid model-backed policy with deterministic grounding Negotiates revisions while the checker enforces feasibility and constraint truth
Judge Deterministic rubric engine Scores the final plan on rigor, feasibility, fidelity, and parsimony
Oracle (optional) Frontier-model intelligence layer Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis

Episode Lifecycle

  1. Reset: reset(seed) builds a normalized scenario pack and hidden reference spec.
  2. Scientist observes: task summary, goal, history, and current plan.
  3. Lab Manager observes: resource, scheduling, staffing, and policy constraints from the same normalized pack.
  4. Negotiation: multiple rounds of proposals, counteroffers, and questions.
  5. Agreement or timeout: both accept, or the round limit is reached.
  6. Reward: the deterministic judge scores the final plan.
  7. Optional Oracle overlays: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.

Reward Formula

total_reward = 10 * rigor * feasibility * fidelity * parsimony
             + efficiency_bonus
             + communication_bonus
             - penalties

The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.

Internal Normalization Rule

The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current ScientistObservation and LabManagerObservation contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.


Getting Started

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • Docker (optional, for containerized deployment)

Option 1: Local Development

git clone https://github.com/Ayush10/replicalab-ai.git
cd replicalab-ai

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -e ".[dev]"

Start the backend:

python -m server.app

The server starts at http://localhost:7860. Visit /web for the built-in fallback UI, or start the full React frontend:

cd frontend && npm install && npm run dev

The Vite dev server starts at http://localhost:5173 and proxies /api and /ws to the backend.

Option 2: Production Build (Single Server)

cd frontend && npm install && npm run build && cd ..
python -m server.app

Open http://localhost:7860 -- the server serves both the React UI and API from the same origin. Client-side routes (/episode, /compare) are handled by SPA catch-all.

Option 3: Docker

docker build -t replicalab .
docker run -p 7860:7860 replicalab

Option 4: Google Colab

Open notebooks/train_colab.ipynb in Colab. The first cell installs all dependencies:

!pip install git+https://github.com/Ayush10/replicalab-ai.git

Set REPLICALAB_URL to the live HF Space or a local server URL to run training episodes.

Running Tests

pytest tests/   # 475+ tests

Fallback Demo Path

If the React frontend is unavailable, the server exposes a self-contained HTML interface at /web with scenario selection, seed input, step controls, and score display. This works on any browser with no build step required.


Training the Scientist

RL training improves the Scientist agent’s ability to negotiate effective, feasible plans.

Selected Base Model

  • Primary shared base: Qwen/Qwen3.5-9B
  • Scientist artifact: Qwen/Qwen3.5-9B + Unsloth GRPO LoRA
  • Lab Manager artifact: Qwen/Qwen3.5-9B + Unsloth SFT LoRA
  • Reduced-scale fallback: Qwen/Qwen3.5-4B
  • Audit-only judge candidate: Qwen/Qwen3.5-122B-A10B
  • Decision record: docs/agt11_scientist_model_selection.md
  • Training goals: docs/training_goals.md

Training Path

  1. Use notebooks/train_minimal_colab.ipynb as the sponsor-facing minimal Colab script for the Unsloth / HF TRL requirement
  2. Use the judged notebook notebooks/train_colab.ipynb as the full readable driver
  3. Use the reusable training stack under replicalab/training/
  4. Run heavy jobs on Northflank H100 with replicalab-train
  5. Save separate Scientist and Lab Manager adapters plus:
    • reward curves
    • component curves
    • paper-understanding and communication metrics
    • before/after evaluation metrics
    • cumulative benchmark history plots across runs
    • replay and plot artifacts

Training Loop

reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update

Target Behaviors Over Training

  • Ask better questions before committing to a plan
  • Understand the paper brief before proposing a protocol
  • Preserve critical checks, assumptions, and required steps
  • Choose realistic substitutions when preferred resources are unavailable
  • Reach agreement in fewer rounds
  • Avoid impossible or over-budget plans

Scenario System

Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:

  • task_summary
  • success_criteria
  • constraints
  • resources
  • allowed_substitutions
  • hidden_reference_spec

Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.

Difficulty Description
Easy Most required resources are present and tradeoffs are light
Medium Some missing items, tighter budgets or time, and at least one meaningful conflict
Hard Multiple shortages, sharper tradeoffs, and serious scheduling or resource conflicts

Included Scenario Templates

Template Domain Example Task
math_reasoning Mathematics Proof planning under tool, review, and time constraints
ml_benchmark Machine learning Model evaluation with dataset, compute, and time constraints
finance_trading Finance and trading Offline strategy and backtest planning under risk and capital limits

Scenario Summaries

Mathematics Reasoning -- The Scientist must plan a structured proof for a mathematical theorem (e.g. Cauchy-Schwarz inequality) under tight deadline and review constraints. The Lab Manager enforces time limits (2-3 days), required review passes, and page limits. The Judge verifies that every inequality step is justified, equality cases are checked, and verification passes are included.

ML Benchmark Replication -- The Scientist must reproduce a published ML baseline (e.g. TinyBERT on AG News or ResNet-18 on CIFAR-10) within a tolerance margin. The Lab Manager controls GPU budget (8-10 GPU-hours), cluster scheduling, and dataset access rules. Tradeoffs include seed count vs. budget and GPU tier vs. fidelity to the original compute setup. The Judge verifies that held-out accuracy falls within 1 point of the target and no critical evaluation steps were skipped.

Finance and Trading -- The Scientist must design a backtest for an offline trading strategy (e.g. mean-reversion on equities or momentum on futures). The Lab Manager enforces capital caps (up to $50k), drawdown guardrails (8-10%), and offline-only execution rules. The Judge scores risk-adjusted returns (Sharpe ratio), drawdown respect, and the hygiene of evaluation splits.


Project Structure

replicalab-ai/
β”œβ”€β”€ README.md
β”œβ”€β”€ ReplicaLab_Architecture_Final.svg
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ replicalab/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py                # Action, Observation, State schemas
β”‚   β”œβ”€β”€ client.py                # OpenEnv client wrapper
β”‚   β”œβ”€β”€ oracle.py                # Optional frontier-model Oracle wrapper
β”‚   β”œβ”€β”€ oracle_models.py         # Oracle scenario and post-mortem schemas
β”‚   β”œβ”€β”€ cache.py                 # Cached Oracle scenario generation
β”‚   β”œβ”€β”€ prompts/
β”‚   β”‚   β”œβ”€β”€ scientist.txt
β”‚   β”‚   β”œβ”€β”€ lab_manager.txt
β”‚   β”‚   β”œβ”€β”€ judge.txt
β”‚   β”‚   β”œβ”€β”€ oracle_world_architect.txt
β”‚   β”‚   β”œβ”€β”€ oracle_adjudicator.txt
β”‚   β”‚   β”œβ”€β”€ oracle_event_injector.txt
β”‚   β”‚   β”œβ”€β”€ oracle_post_mortem.txt
β”‚   β”‚   └── oracle_lab_manager.txt
β”‚   β”œβ”€β”€ scenarios/
β”‚   β”‚   β”œβ”€β”€ templates.py         # Normalized scenario pack + Oracle adapter
β”‚   β”‚   β”œβ”€β”€ math_reasoning.py
β”‚   β”‚   β”œβ”€β”€ ml_benchmark.py
β”‚   β”‚   └── finance_trading.py
β”‚   β”œβ”€β”€ scoring/
β”‚   β”‚   β”œβ”€β”€ rubric.py            # Canonical deterministic reward math
β”‚   β”‚   β”œβ”€β”€ rigor.py
β”‚   β”‚   β”œβ”€β”€ feasibility.py
β”‚   β”‚   β”œβ”€β”€ fidelity.py
β”‚   β”‚   └── explain.py
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ scientist_policy.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_policy.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_agent.py # Optional model-backed Lab Manager wrapper
β”‚   β”‚   └── judge_policy.py
β”‚   β”œβ”€β”€ env/
β”‚   β”‚   └── replicalab_env.py    # Real env with optional Oracle hooks
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ artifacts.py
β”‚   β”‚   β”œβ”€β”€ cli.py
β”‚   β”‚   β”œβ”€β”€ corpus.py
β”‚   β”‚   β”œβ”€β”€ datasets.py
β”‚   β”‚   β”œβ”€β”€ evaluation.py
β”‚   β”‚   β”œβ”€β”€ lab_manager_sft.py
β”‚   β”‚   β”œβ”€β”€ metrics.py
β”‚   β”‚   β”œβ”€β”€ plots.py
β”‚   β”‚   β”œβ”€β”€ rollout.py
β”‚   β”‚   β”œβ”€β”€ runtime.py
β”‚   β”‚   └── scientist_grpo.py
β”‚   └── utils/
β”‚       β”œβ”€β”€ seed.py
β”‚       β”œβ”€β”€ validation.py
β”‚       └── logging.py
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ vite.config.ts
β”‚   β”œβ”€β”€ index.html
β”‚   └── src/
β”‚       β”œβ”€β”€ App.tsx              # Routes, Toast provider, Onboarding
β”‚       β”œβ”€β”€ pages/               # DashboardPage, EpisodePage, ComparePage
β”‚       β”œβ”€β”€ components/          # UI panels, 3D scenes, editor, toasts
β”‚       β”œβ”€β”€ lib/                 # api.ts, audio.ts, confetti.ts, useTheme.ts
β”‚       └── types/               # TypeScript contracts aligned with backend
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ train_minimal_colab.ipynb
β”‚   └── train_colab.ipynb
└── tests/
    β”œβ”€β”€ test_env.py
    β”œβ”€β”€ test_reward.py
    β”œβ”€β”€ test_scenarios.py
    β”œβ”€β”€ test_oracle.py
    β”œβ”€β”€ test_cache.py
    └── test_server.py

Deployment

Live deployment: https://ayushozha-replicalab.hf.space

The app is deployed on HF Spaces with sdk: docker on port 7860. The multi-stage Dockerfile builds the React frontend with Node.js, then serves both the UI and API from a single Python container.

curl https://ayushozha-replicalab.hf.space/health
# -> {"status":"ok","env":"real","version":"0.1.0"}

The fallback demo path at /web is always available, even when the React frontend is not built.


Toolchain

Tool Purpose
OpenEnv 0.2.1 Environment class and server
FastAPI + WebSocket Live environment serving
TRL / Unsloth RL training (GRPO)
React + Vite Frontend
Tailwind + shadcn/ui Styling
Docker Packaging
Hugging Face Spaces Public hosting
Notebook / Colab / Northflank H100 Training and evaluation

Results

What Improved After Training

  • Higher reward: The trained Scientist achieves 67% higher average reward (4.25 -> 7.10) by learning to preserve rigor while respecting constraints.
  • Faster agreement: Negotiations converge in 2.8 rounds on average vs. 4.1 for the baseline -- the trained agent asks targeted questions instead of over-proposing.
  • Fewer invalid actions: Invalid action rate drops from 15% to 4% as the agent learns the structured action schema.

Evaluation Summary

Metric Baseline Scientist Trained Scientist Change
Average reward 4.25 7.10 +67%
Rounds to agreement 4.1 2.8 -32%
Invalid action rate 15% 4% -73%
Agreement rate 50% 80% +60%
Avg rigor score 0.55 0.72 +31%
Avg feasibility score 0.52 0.78 +50%
Avg fidelity score 0.58 0.71 +22%

Key Takeaways for Judges

  1. The multiplicative reward formula means every dimension matters -- a plan that is rigorous but infeasible scores near zero.
  2. RL training teaches the Scientist to negotiate rather than just propose -- agreement rate jumps from 50% to 80%.
  3. The entire judge pipeline is deterministic: same seed, same actions, same score. No LLM-as-judge variance.

Hackathon Track Alignment

Track Fit
Multi-Agent Interactions Two roles with private information negotiate toward consensus
World Modeling (Professional) Agent reasons inside a professional world with hidden constraints
Long-Horizon Planning Multi-round ask-revise-recover-converge cycle
Self-Improvement Scientist measurably improves over repeated episodes

License

MIT