Spaces:
Running
Running
Scenarios Map β replicalab/scenarios/
Normalized scenario generation across 3 domains with seeded determinism.
Tasks implemented: SCN 01-12
Entry Point
generate_scenario(seed, template, difficulty) -> NormalizedScenarioPack
Located in templates.py. The main public API.
Flow:
seed_rng(seed)β deterministicrandom.Randominstanceload_template(template)β picks the template builder functionbuilder(rng)β raw draft dict (randomly selects one of 2 cases per domain)apply_difficulty(draft, difficulty, rng)β scales budget, time, staff, resources_build_pack(seed, template, draft)β constructsNormalizedScenarioPack
available_scenario_families() -> list[dict]
Returns [{"family": name, "difficulties": ["easy", "medium", "hard"]}] for each template.
Core Data Classes (all in templates.py)
NormalizedScenarioPack(BaseModel) β extra="forbid"
The complete scenario definition. Every downstream consumer uses this.
| Field | Type | Source |
|---|---|---|
scenario_id |
str |
"{template}_{seed}" |
template |
TemplateName |
input param |
domain_id |
str |
from template case |
difficulty |
Difficulty |
input param |
seed |
int |
input param |
task_summary |
str |
from template case |
success_criteria |
list[str] |
from template case |
constraints |
list[ScenarioConstraint] |
from template + difficulty scaling |
resources |
list[ScenarioResource] |
from template + difficulty scaling |
allowed_substitutions |
list[AllowedSubstitution] |
from template case |
hidden_reference_spec |
HiddenReferenceSpec |
from template case |
scientist_observation |
ScientistObservation |
built from case fields |
lab_manager_observation |
LabManagerObservation |
built from case fields |
ScenarioConstraint(BaseModel)
| Field | Type | Example |
|---|---|---|
key |
str |
"gpu_hours" |
label |
str |
"Maximum GPU budget" |
quantity |
float | int | None |
8 |
unit |
str | None |
"gpu_hours" |
comparator |
Literal["<=", ">=", "="] |
"<=" |
hard |
bool |
True |
details |
str |
"The full run must fit within eight GPU-hours." |
ScenarioResource(BaseModel)
| Field | Type | Example |
|---|---|---|
key |
str |
"gpu_node" |
label |
str |
"A100 GPU node" |
quantity |
float | int | None |
1 |
unit |
str | None |
"node" |
available |
bool |
True |
category |
str |
"compute" |
details |
str |
"Reserved for one benchmark run at a time." |
AllowedSubstitution(BaseModel)
| Field | Type | Example |
|---|---|---|
original |
str |
"A100 GPU node" |
alternative |
str |
"V100 GPU node" |
condition |
str |
"Use if A100 is booked." |
tradeoff |
str |
"V100 is slower; extend training by ~30%." |
HiddenReferenceSpec(BaseModel)
Ground truth the judge uses to score fidelity. The scientist never sees this.
| Field | Type | Example |
|---|---|---|
summary |
str |
"A valid plan keeps the published split..." |
required_elements |
list[str] |
["published data split", "held-out accuracy evaluation"] |
flexible_elements |
list[str] |
["batch size", "learning-rate schedule"] |
target_metric |
str |
"held_out_accuracy" |
target_value |
str |
"within one point of the reported baseline" |
Template Builders
Each returns a raw dict[str, Any] with one randomly selected case.
build_math_reasoning_template(rng) β math_reasoning.py
- Domain:
mathematics - Case A: Cauchy-Schwarz inequality β structured proof verification
- Case B: Jensen's inequality β convexity-based proof
- Equipment: Structured proof notebook, Automated proof checker
- Reagents: Graduate reviewer, Reference textbook
- Substitutions: Graduate reviewer β self-check rubric
build_ml_benchmark_template(rng) β ml_benchmark.py
- Domain:
machine_learning - Case A: AG News TinyBERT β text classification replication
- Case B: CIFAR-10 ResNet-18 β image classification replication
- Equipment: A100 GPU node, Dataset mirror, Experiment tracker
- Reagents: Pre-trained checkpoint, Evaluation harness
- Substitutions: A100 β V100 (slower), full dataset β stratified sample
build_finance_trading_template(rng) β finance_trading.py
- Domain:
finance_trading - Case A: SPY/QQQ mean-reversion β pairs trading backtest
- Case B: Momentum futures β trend-following strategy
- Equipment: Backtest engine, Historical daily bar dataset
- Reagents: Risk reviewer, Compliance packet
- Substitutions: Daily bars β weekly bars, risk reviewer β automated risk check
- Safety restrictions: offline-only execution policy
Difficulty Scaling β apply_difficulty(draft, difficulty, rng)
| Parameter | Easy | Medium | Hard |
|---|---|---|---|
budget_total |
Γ1.15 | Γ0.95 | Γ0.80 |
time_limit_days |
unchanged | β1 day | β1 day |
staff_count |
unchanged | unchanged | β1 person |
| Resources tightened | 0 | 1 | 2 |
| Conflict constraint | no | yes (1) | yes (1) |
_tighten_one_resource: picks a random resource, sets available=False.
_append_conflict_constraint: adds a soft constraint noting resource conflict.
Utility β replicalab/utils/seed.py
| Function | Purpose |
|---|---|
get_deterministic_seed(seed, namespace) |
SHA256-based child seed derivation |
seed_rng(seed, namespace) |
Returns random.Random(derived_seed) |
Type Aliases
Difficulty = Literal["easy", "medium", "hard"]
TemplateName = Literal["math_reasoning", "ml_benchmark", "finance_trading"]
TemplateBuilder = Callable[[Any], dict[str, Any]]
Constants
GOLDEN_SCENARIO_SPECS_PATH = Path("tests/fixtures/golden_scenarios.json")
Who Consumes This
validation.pyβ reads constraints, resources, substitutions, hidden_reference_speclab_manager_policy.pyβ reads lab_manager_observation, substitutions, constraintsscientist_policy.pyβ reads scenario pack for system prompt generationserver/app.pyβ callsgenerate_scenario()on reset, stores pack for lab managerscoring/(future) β will read hidden_reference_spec for fidelity scoring