Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /pitch_outline.md

maxxie114

Initial HF Spaces deployment

80d8c84 about 1 month ago

preview code

raw

history blame contribute delete

2.89 kB

Three-Minute Pitch + Two-Minute Q&A Outline (DOC 10)

Pitch Structure (3 minutes)

1. The Problem (30 seconds)

"Over 70% of landmark studies fail to replicate. The gap isn't bad science -- it's that real-world constraints force compromises that nobody planned for. Budgets shrink, equipment breaks, timelines slip. The protocol that worked in Theory A fails under Constraint B."

2. Our Solution (30 seconds)

"ReplicaLab is an OpenEnv environment where an AI Scientist learns to negotiate realistic replication plans. A Lab Manager enforces real constraints -- GPU budgets, scheduling conflicts, equipment limits. A deterministic Judge scores every plan. Through RL, the Scientist gets measurably better at navigating tradeoffs."

3. Live Demo (60 seconds)

Show HF Space or local frontend
Start an ML Benchmark episode (seed 42, medium difficulty)
Point out the Scientist's proposal and Lab Manager's feasibility report
Show the Judge scoring: rigor, feasibility, fidelity breakdown
Toggle to training results: before vs after comparison

4. Technical Architecture (30 seconds)

"Three scenario families -- math, ML, finance -- each with deterministic seed-based generation. The reward formula is multiplicative: 10 x rigor x feasibility x fidelity. Every dimension must score well. The entire judge is deterministic -- same seed, same actions, same score. No LLM-as-judge variance."

5. Results (20 seconds)

"After RL training: 67% higher reward, 32% fewer negotiation rounds, invalid actions drop from 15% to 4%, agreement rate jumps from 50% to 80%."

6. Close (10 seconds)

"ReplicaLab. An OpenEnv world where agents learn to negotiate science."

Anticipated Q&A Topics

Question	Talking Points
Why deterministic scoring?	Noisy rewards make RL unstable. Deterministic judge = reproducible training. Optional Oracle layer adds richness without corrupting the reward signal.
How does difficulty scaling work?	Mechanical constraint tightening: budgets shrink, resources go out of stock, scheduling conflicts appear. Same outer contract at every difficulty.
What model do you train?	Qwen3-8B with GRPO via Unsloth/TRL. 4B fallback for faster iteration.
How many scenarios?	3 domain families x 3 difficulties x infinite seeds. Each seed produces a unique but deterministic scenario.
Why not LLM-as-judge?	Variance. Two runs of the same episode would get different scores. We need a stable reward signal for RL. The optional Oracle post-mortem adds natural language analysis without replacing the score.
What's the Lab Manager?	Hybrid: deterministic feasibility checker (ground truth) + optional model narration. Checker output is always the source of truth.
Fallback if UI breaks?	`/web` endpoint serves a self-contained HTML interface with no build step.