replicalab / docs /map /scenarios.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84
# Scenarios Map β€” `replicalab/scenarios/`
> Normalized scenario generation across 3 domains with seeded determinism.
>
> **Tasks implemented:** SCN 01-12
## Entry Point
### `generate_scenario(seed, template, difficulty) -> NormalizedScenarioPack`
Located in `templates.py`. The main public API.
**Flow:**
1. `seed_rng(seed)` β†’ deterministic `random.Random` instance
2. `load_template(template)` β†’ picks the template builder function
3. `builder(rng)` β†’ raw draft dict (randomly selects one of 2 cases per domain)
4. `apply_difficulty(draft, difficulty, rng)` β†’ scales budget, time, staff, resources
5. `_build_pack(seed, template, draft)` β†’ constructs `NormalizedScenarioPack`
### `available_scenario_families() -> list[dict]`
Returns `[{"family": name, "difficulties": ["easy", "medium", "hard"]}]` for each template.
## Core Data Classes (all in `templates.py`)
### `NormalizedScenarioPack(BaseModel)` β€” `extra="forbid"`
The complete scenario definition. Every downstream consumer uses this.
| Field | Type | Source |
|-------|------|--------|
| `scenario_id` | `str` | `"{template}_{seed}"` |
| `template` | `TemplateName` | input param |
| `domain_id` | `str` | from template case |
| `difficulty` | `Difficulty` | input param |
| `seed` | `int` | input param |
| `task_summary` | `str` | from template case |
| `success_criteria` | `list[str]` | from template case |
| `constraints` | `list[ScenarioConstraint]` | from template + difficulty scaling |
| `resources` | `list[ScenarioResource]` | from template + difficulty scaling |
| `allowed_substitutions` | `list[AllowedSubstitution]` | from template case |
| `hidden_reference_spec` | `HiddenReferenceSpec` | from template case |
| `scientist_observation` | `ScientistObservation` | built from case fields |
| `lab_manager_observation` | `LabManagerObservation` | built from case fields |
### `ScenarioConstraint(BaseModel)`
| Field | Type | Example |
|-------|------|---------|
| `key` | `str` | `"gpu_hours"` |
| `label` | `str` | `"Maximum GPU budget"` |
| `quantity` | `float \| int \| None` | `8` |
| `unit` | `str \| None` | `"gpu_hours"` |
| `comparator` | `Literal["<=", ">=", "="]` | `"<="` |
| `hard` | `bool` | `True` |
| `details` | `str` | `"The full run must fit within eight GPU-hours."` |
### `ScenarioResource(BaseModel)`
| Field | Type | Example |
|-------|------|---------|
| `key` | `str` | `"gpu_node"` |
| `label` | `str` | `"A100 GPU node"` |
| `quantity` | `float \| int \| None` | `1` |
| `unit` | `str \| None` | `"node"` |
| `available` | `bool` | `True` |
| `category` | `str` | `"compute"` |
| `details` | `str` | `"Reserved for one benchmark run at a time."` |
### `AllowedSubstitution(BaseModel)`
| Field | Type | Example |
|-------|------|---------|
| `original` | `str` | `"A100 GPU node"` |
| `alternative` | `str` | `"V100 GPU node"` |
| `condition` | `str` | `"Use if A100 is booked."` |
| `tradeoff` | `str` | `"V100 is slower; extend training by ~30%."` |
### `HiddenReferenceSpec(BaseModel)`
Ground truth the judge uses to score fidelity. The scientist never sees this.
| Field | Type | Example |
|-------|------|---------|
| `summary` | `str` | `"A valid plan keeps the published split..."` |
| `required_elements` | `list[str]` | `["published data split", "held-out accuracy evaluation"]` |
| `flexible_elements` | `list[str]` | `["batch size", "learning-rate schedule"]` |
| `target_metric` | `str` | `"held_out_accuracy"` |
| `target_value` | `str` | `"within one point of the reported baseline"` |
## Template Builders
Each returns a raw `dict[str, Any]` with one randomly selected case.
### `build_math_reasoning_template(rng)` β€” `math_reasoning.py`
- **Domain:** `mathematics`
- **Case A:** Cauchy-Schwarz inequality β€” structured proof verification
- **Case B:** Jensen's inequality β€” convexity-based proof
- **Equipment:** Structured proof notebook, Automated proof checker
- **Reagents:** Graduate reviewer, Reference textbook
- **Substitutions:** Graduate reviewer β†’ self-check rubric
### `build_ml_benchmark_template(rng)` β€” `ml_benchmark.py`
- **Domain:** `machine_learning`
- **Case A:** AG News TinyBERT β€” text classification replication
- **Case B:** CIFAR-10 ResNet-18 β€” image classification replication
- **Equipment:** A100 GPU node, Dataset mirror, Experiment tracker
- **Reagents:** Pre-trained checkpoint, Evaluation harness
- **Substitutions:** A100 β†’ V100 (slower), full dataset β†’ stratified sample
### `build_finance_trading_template(rng)` β€” `finance_trading.py`
- **Domain:** `finance_trading`
- **Case A:** SPY/QQQ mean-reversion β€” pairs trading backtest
- **Case B:** Momentum futures β€” trend-following strategy
- **Equipment:** Backtest engine, Historical daily bar dataset
- **Reagents:** Risk reviewer, Compliance packet
- **Substitutions:** Daily bars β†’ weekly bars, risk reviewer β†’ automated risk check
- **Safety restrictions:** offline-only execution policy
## Difficulty Scaling β€” `apply_difficulty(draft, difficulty, rng)`
| Parameter | Easy | Medium | Hard |
|-----------|------|--------|------|
| `budget_total` | Γ—1.15 | Γ—0.95 | Γ—0.80 |
| `time_limit_days` | unchanged | βˆ’1 day | βˆ’1 day |
| `staff_count` | unchanged | unchanged | βˆ’1 person |
| Resources tightened | 0 | 1 | 2 |
| Conflict constraint | no | yes (1) | yes (1) |
**`_tighten_one_resource`**: picks a random resource, sets `available=False`.
**`_append_conflict_constraint`**: adds a soft constraint noting resource conflict.
## Utility β€” `replicalab/utils/seed.py`
| Function | Purpose |
|----------|---------|
| `get_deterministic_seed(seed, namespace)` | SHA256-based child seed derivation |
| `seed_rng(seed, namespace)` | Returns `random.Random(derived_seed)` |
## Type Aliases
```python
Difficulty = Literal["easy", "medium", "hard"]
TemplateName = Literal["math_reasoning", "ml_benchmark", "finance_trading"]
TemplateBuilder = Callable[[Any], dict[str, Any]]
```
## Constants
```python
GOLDEN_SCENARIO_SPECS_PATH = Path("tests/fixtures/golden_scenarios.json")
```
## Who Consumes This
- **`validation.py`** β€” reads constraints, resources, substitutions, hidden_reference_spec
- **`lab_manager_policy.py`** β€” reads lab_manager_observation, substitutions, constraints
- **`scientist_policy.py`** β€” reads scenario pack for system prompt generation
- **`server/app.py`** β€” calls `generate_scenario()` on reset, stores pack for lab manager
- **`scoring/`** (future) β€” will read hidden_reference_spec for fidelity scoring