File size: 6,517 Bytes
80d8c84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# Scenarios Map β€” `replicalab/scenarios/`

> Normalized scenario generation across 3 domains with seeded determinism.
>
> **Tasks implemented:** SCN 01-12

## Entry Point

### `generate_scenario(seed, template, difficulty) -> NormalizedScenarioPack`
Located in `templates.py`. The main public API.

**Flow:**
1. `seed_rng(seed)` β†’ deterministic `random.Random` instance
2. `load_template(template)` β†’ picks the template builder function
3. `builder(rng)` β†’ raw draft dict (randomly selects one of 2 cases per domain)
4. `apply_difficulty(draft, difficulty, rng)` β†’ scales budget, time, staff, resources
5. `_build_pack(seed, template, draft)` β†’ constructs `NormalizedScenarioPack`

### `available_scenario_families() -> list[dict]`
Returns `[{"family": name, "difficulties": ["easy", "medium", "hard"]}]` for each template.

## Core Data Classes (all in `templates.py`)

### `NormalizedScenarioPack(BaseModel)` β€” `extra="forbid"`
The complete scenario definition. Every downstream consumer uses this.

| Field | Type | Source |
|-------|------|--------|
| `scenario_id` | `str` | `"{template}_{seed}"` |
| `template` | `TemplateName` | input param |
| `domain_id` | `str` | from template case |
| `difficulty` | `Difficulty` | input param |
| `seed` | `int` | input param |
| `task_summary` | `str` | from template case |
| `success_criteria` | `list[str]` | from template case |
| `constraints` | `list[ScenarioConstraint]` | from template + difficulty scaling |
| `resources` | `list[ScenarioResource]` | from template + difficulty scaling |
| `allowed_substitutions` | `list[AllowedSubstitution]` | from template case |
| `hidden_reference_spec` | `HiddenReferenceSpec` | from template case |
| `scientist_observation` | `ScientistObservation` | built from case fields |
| `lab_manager_observation` | `LabManagerObservation` | built from case fields |

### `ScenarioConstraint(BaseModel)`
| Field | Type | Example |
|-------|------|---------|
| `key` | `str` | `"gpu_hours"` |
| `label` | `str` | `"Maximum GPU budget"` |
| `quantity` | `float \| int \| None` | `8` |
| `unit` | `str \| None` | `"gpu_hours"` |
| `comparator` | `Literal["<=", ">=", "="]` | `"<="` |
| `hard` | `bool` | `True` |
| `details` | `str` | `"The full run must fit within eight GPU-hours."` |

### `ScenarioResource(BaseModel)`
| Field | Type | Example |
|-------|------|---------|
| `key` | `str` | `"gpu_node"` |
| `label` | `str` | `"A100 GPU node"` |
| `quantity` | `float \| int \| None` | `1` |
| `unit` | `str \| None` | `"node"` |
| `available` | `bool` | `True` |
| `category` | `str` | `"compute"` |
| `details` | `str` | `"Reserved for one benchmark run at a time."` |

### `AllowedSubstitution(BaseModel)`
| Field | Type | Example |
|-------|------|---------|
| `original` | `str` | `"A100 GPU node"` |
| `alternative` | `str` | `"V100 GPU node"` |
| `condition` | `str` | `"Use if A100 is booked."` |
| `tradeoff` | `str` | `"V100 is slower; extend training by ~30%."` |

### `HiddenReferenceSpec(BaseModel)`
Ground truth the judge uses to score fidelity. The scientist never sees this.

| Field | Type | Example |
|-------|------|---------|
| `summary` | `str` | `"A valid plan keeps the published split..."` |
| `required_elements` | `list[str]` | `["published data split", "held-out accuracy evaluation"]` |
| `flexible_elements` | `list[str]` | `["batch size", "learning-rate schedule"]` |
| `target_metric` | `str` | `"held_out_accuracy"` |
| `target_value` | `str` | `"within one point of the reported baseline"` |

## Template Builders

Each returns a raw `dict[str, Any]` with one randomly selected case.

### `build_math_reasoning_template(rng)` β€” `math_reasoning.py`
- **Domain:** `mathematics`
- **Case A:** Cauchy-Schwarz inequality β€” structured proof verification
- **Case B:** Jensen's inequality β€” convexity-based proof
- **Equipment:** Structured proof notebook, Automated proof checker
- **Reagents:** Graduate reviewer, Reference textbook
- **Substitutions:** Graduate reviewer β†’ self-check rubric

### `build_ml_benchmark_template(rng)` β€” `ml_benchmark.py`
- **Domain:** `machine_learning`
- **Case A:** AG News TinyBERT β€” text classification replication
- **Case B:** CIFAR-10 ResNet-18 β€” image classification replication
- **Equipment:** A100 GPU node, Dataset mirror, Experiment tracker
- **Reagents:** Pre-trained checkpoint, Evaluation harness
- **Substitutions:** A100 β†’ V100 (slower), full dataset β†’ stratified sample

### `build_finance_trading_template(rng)` β€” `finance_trading.py`
- **Domain:** `finance_trading`
- **Case A:** SPY/QQQ mean-reversion β€” pairs trading backtest
- **Case B:** Momentum futures β€” trend-following strategy
- **Equipment:** Backtest engine, Historical daily bar dataset
- **Reagents:** Risk reviewer, Compliance packet
- **Substitutions:** Daily bars β†’ weekly bars, risk reviewer β†’ automated risk check
- **Safety restrictions:** offline-only execution policy

## Difficulty Scaling β€” `apply_difficulty(draft, difficulty, rng)`

| Parameter | Easy | Medium | Hard |
|-----------|------|--------|------|
| `budget_total` | Γ—1.15 | Γ—0.95 | Γ—0.80 |
| `time_limit_days` | unchanged | βˆ’1 day | βˆ’1 day |
| `staff_count` | unchanged | unchanged | βˆ’1 person |
| Resources tightened | 0 | 1 | 2 |
| Conflict constraint | no | yes (1) | yes (1) |

**`_tighten_one_resource`**: picks a random resource, sets `available=False`.
**`_append_conflict_constraint`**: adds a soft constraint noting resource conflict.

## Utility β€” `replicalab/utils/seed.py`

| Function | Purpose |
|----------|---------|
| `get_deterministic_seed(seed, namespace)` | SHA256-based child seed derivation |
| `seed_rng(seed, namespace)` | Returns `random.Random(derived_seed)` |

## Type Aliases

```python
Difficulty = Literal["easy", "medium", "hard"]
TemplateName = Literal["math_reasoning", "ml_benchmark", "finance_trading"]
TemplateBuilder = Callable[[Any], dict[str, Any]]
```

## Constants

```python
GOLDEN_SCENARIO_SPECS_PATH = Path("tests/fixtures/golden_scenarios.json")
```

## Who Consumes This

- **`validation.py`** β€” reads constraints, resources, substitutions, hidden_reference_spec
- **`lab_manager_policy.py`** β€” reads lab_manager_observation, substitutions, constraints
- **`scientist_policy.py`** β€” reads scenario pack for system prompt generation
- **`server/app.py`** β€” calls `generate_scenario()` on reset, stores pack for lab manager
- **`scoring/`** (future) β€” will read hidden_reference_spec for fidelity scoring