replicalab / docs /map /scenarios.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Scenarios Map β€” replicalab/scenarios/

Normalized scenario generation across 3 domains with seeded determinism.

Tasks implemented: SCN 01-12

Entry Point

generate_scenario(seed, template, difficulty) -> NormalizedScenarioPack

Located in templates.py. The main public API.

Flow:

  1. seed_rng(seed) β†’ deterministic random.Random instance
  2. load_template(template) β†’ picks the template builder function
  3. builder(rng) β†’ raw draft dict (randomly selects one of 2 cases per domain)
  4. apply_difficulty(draft, difficulty, rng) β†’ scales budget, time, staff, resources
  5. _build_pack(seed, template, draft) β†’ constructs NormalizedScenarioPack

available_scenario_families() -> list[dict]

Returns [{"family": name, "difficulties": ["easy", "medium", "hard"]}] for each template.

Core Data Classes (all in templates.py)

NormalizedScenarioPack(BaseModel) β€” extra="forbid"

The complete scenario definition. Every downstream consumer uses this.

Field Type Source
scenario_id str "{template}_{seed}"
template TemplateName input param
domain_id str from template case
difficulty Difficulty input param
seed int input param
task_summary str from template case
success_criteria list[str] from template case
constraints list[ScenarioConstraint] from template + difficulty scaling
resources list[ScenarioResource] from template + difficulty scaling
allowed_substitutions list[AllowedSubstitution] from template case
hidden_reference_spec HiddenReferenceSpec from template case
scientist_observation ScientistObservation built from case fields
lab_manager_observation LabManagerObservation built from case fields

ScenarioConstraint(BaseModel)

Field Type Example
key str "gpu_hours"
label str "Maximum GPU budget"
quantity float | int | None 8
unit str | None "gpu_hours"
comparator Literal["<=", ">=", "="] "<="
hard bool True
details str "The full run must fit within eight GPU-hours."

ScenarioResource(BaseModel)

Field Type Example
key str "gpu_node"
label str "A100 GPU node"
quantity float | int | None 1
unit str | None "node"
available bool True
category str "compute"
details str "Reserved for one benchmark run at a time."

AllowedSubstitution(BaseModel)

Field Type Example
original str "A100 GPU node"
alternative str "V100 GPU node"
condition str "Use if A100 is booked."
tradeoff str "V100 is slower; extend training by ~30%."

HiddenReferenceSpec(BaseModel)

Ground truth the judge uses to score fidelity. The scientist never sees this.

Field Type Example
summary str "A valid plan keeps the published split..."
required_elements list[str] ["published data split", "held-out accuracy evaluation"]
flexible_elements list[str] ["batch size", "learning-rate schedule"]
target_metric str "held_out_accuracy"
target_value str "within one point of the reported baseline"

Template Builders

Each returns a raw dict[str, Any] with one randomly selected case.

build_math_reasoning_template(rng) β€” math_reasoning.py

  • Domain: mathematics
  • Case A: Cauchy-Schwarz inequality β€” structured proof verification
  • Case B: Jensen's inequality β€” convexity-based proof
  • Equipment: Structured proof notebook, Automated proof checker
  • Reagents: Graduate reviewer, Reference textbook
  • Substitutions: Graduate reviewer β†’ self-check rubric

build_ml_benchmark_template(rng) β€” ml_benchmark.py

  • Domain: machine_learning
  • Case A: AG News TinyBERT β€” text classification replication
  • Case B: CIFAR-10 ResNet-18 β€” image classification replication
  • Equipment: A100 GPU node, Dataset mirror, Experiment tracker
  • Reagents: Pre-trained checkpoint, Evaluation harness
  • Substitutions: A100 β†’ V100 (slower), full dataset β†’ stratified sample

build_finance_trading_template(rng) β€” finance_trading.py

  • Domain: finance_trading
  • Case A: SPY/QQQ mean-reversion β€” pairs trading backtest
  • Case B: Momentum futures β€” trend-following strategy
  • Equipment: Backtest engine, Historical daily bar dataset
  • Reagents: Risk reviewer, Compliance packet
  • Substitutions: Daily bars β†’ weekly bars, risk reviewer β†’ automated risk check
  • Safety restrictions: offline-only execution policy

Difficulty Scaling β€” apply_difficulty(draft, difficulty, rng)

Parameter Easy Medium Hard
budget_total Γ—1.15 Γ—0.95 Γ—0.80
time_limit_days unchanged βˆ’1 day βˆ’1 day
staff_count unchanged unchanged βˆ’1 person
Resources tightened 0 1 2
Conflict constraint no yes (1) yes (1)

_tighten_one_resource: picks a random resource, sets available=False. _append_conflict_constraint: adds a soft constraint noting resource conflict.

Utility β€” replicalab/utils/seed.py

Function Purpose
get_deterministic_seed(seed, namespace) SHA256-based child seed derivation
seed_rng(seed, namespace) Returns random.Random(derived_seed)

Type Aliases

Difficulty = Literal["easy", "medium", "hard"]
TemplateName = Literal["math_reasoning", "ml_benchmark", "finance_trading"]
TemplateBuilder = Callable[[Any], dict[str, Any]]

Constants

GOLDEN_SCENARIO_SPECS_PATH = Path("tests/fixtures/golden_scenarios.json")

Who Consumes This

  • validation.py β€” reads constraints, resources, substitutions, hidden_reference_spec
  • lab_manager_policy.py β€” reads lab_manager_observation, substitutions, constraints
  • scientist_policy.py β€” reads scenario pack for system prompt generation
  • server/app.py β€” calls generate_scenario() on reset, stores pack for lab manager
  • scoring/ (future) β€” will read hidden_reference_spec for fidelity scoring