ayushozha commited on
Commit
ec2e890
·
1 Parent(s): fe921c8

Add hybrid Oracle layer and update architecture docs

Browse files
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: ReplicaLab
3
- emoji: 🧪
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: docker
@@ -14,37 +14,53 @@ pinned: false
14
 
15
  > *How do we adapt a plan without breaking the objective?*
16
 
17
- ReplicaLab trains an agent to negotiate high-quality plans under real constraints. The initial domain focus is mathematics and machine learning, with finance and trading design in offline or backtest form as the third scenario family. Physics and biology remain later adapters once the core normalized scenario layer is stable.
18
 
19
  ## Current Build Status
20
 
21
- - The repository is now past the foundation stage and has a working real environment plus deterministic judge pipeline.
22
- - The Python package foundation is verified through editable install plus full test-suite checks.
23
- - Shared contracts currently live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
24
- - `server/app.py` now serves the real `ReplicaLabEnv` by default, with the legacy stub retained only as a fallback safety path.
25
  - `openenv.yaml` exists and passes local OpenEnv validation.
26
  - Local Docker validation has been completed for the server image on port `7860`.
27
- - Hugging Face Spaces deployment is live at `https://ayushozha-replicalab.hf.space` with all endpoints verified.
28
- - The frozen outer contract remains stable while the internal scenario engine moves toward a normalized scenario pack.
29
- - The planned Lab Manager path is hybrid: model-backed negotiation language plus deterministic feasibility grounding.
 
30
 
31
  ## Team Ownership
32
 
33
  | Owner | Current focus |
34
  |------|----------------|
35
- | Kian (Person A) | Shared schemas, validation, normalized scenario engine, judge logic |
36
- | Person B (Ayush) | Contract freeze, domain-neutral Scientist prompting and parsing, notebook and client path |
37
- | Max (Person C) | Repo/runtime setup, frontend shell, server and deployment plumbing |
38
- | Kush (Person D) | README and demo docs, UI shell planning, polish and presentation assets |
39
 
40
  ---
41
 
42
  ## Architecture
43
 
44
  <p align="center">
45
- <img src="./architecture.svg" alt="ReplicaLab Architecture" width="100%"/>
46
  </p>
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ---
49
 
50
  ## How It Works
@@ -55,26 +71,31 @@ Each episode simulates a negotiation between two agents inside a constrained tec
55
  |------|------|----------------|
56
  | **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
57
  | **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
58
- | **Judge** | Deterministic rubric engine | Scores the final plan on Rigor, Feasibility, and Fidelity |
 
59
 
60
  ### Episode Lifecycle
61
 
62
- 1. **Reset** -- `reset(seed)` generates a normalized scenario pack, mapped role observations, and a hidden reference spec
63
- 2. **Scientist observes** -- Task summary, experiment or benchmark goal, conversation history, current plan
64
- 3. **Lab Manager observes** -- Resource, scheduling, staffing, and policy constraints mapped from the same normalized pack
65
- 4. **Negotiation** -- Multiple rounds of proposals, counteroffers, and questions
66
- 5. **Agreement or timeout** -- Both accept, or the round limit is reached
67
- 6. **Reward** -- Judge scores the final plan against the hidden reference spec
 
68
 
69
  ### Reward Formula
70
 
71
- ```
72
- total_reward = 10 * rigor * feasibility * fidelity + efficiency_bonus + communication_bonus - penalties
 
 
 
73
  ```
74
 
75
- The **multiplicative core** prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low.
76
 
77
- ### Internal normalization rule
78
 
79
  The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
80
 
@@ -82,27 +103,24 @@ The outer action and observation models stay stable. Domain-specific content is
82
 
83
  ## Getting Started
84
 
85
- This section mixes verified foundation commands with planned end-to-end commands. As of 2026-03-08, the verified local path is the editable Python install plus the shared-model import smoke test.
86
 
87
  ### Prerequisites
88
 
89
  - Python 3.10+
90
- - Node.js 18+ (for the frontend)
91
- - Docker (for deployment)
92
- - A Google Colab account (for RL training)
93
 
94
  ### Installation
95
 
96
  ```bash
97
- # Clone the repository
98
  git clone https://github.com/Ayush10/replicalab-ai.git
99
  cd replicalab-ai
100
 
101
- # Create a virtual environment
102
  python -m venv .venv
103
- source .venv/bin/activate # On Windows: .venv\Scripts\activate
104
 
105
- # Install Python dependencies
106
  pip install -e ".[dev]"
107
  ```
108
 
@@ -115,11 +133,10 @@ python -c "from replicalab.models import ScientistAction, LabManagerAction; prin
115
  ### Running the Environment Server
116
 
117
  ```bash
118
- # Planned command once server wiring lands
119
  python -m server.app
120
  ```
121
 
122
- The server is intended to start at `http://localhost:7860` once the server task chain is complete.
123
 
124
  ### Running the Frontend
125
 
@@ -129,8 +146,6 @@ npm install
129
  npm run dev
130
  ```
131
 
132
- The React UI is intended to start at `http://localhost:5173` once the frontend shell and Vite config are in place.
133
-
134
  ### Running Tests
135
 
136
  ```bash
@@ -141,30 +156,36 @@ pytest tests/
141
 
142
  ## Training the Scientist
143
 
144
- RL training improves the Scientist agent's ability to negotiate effective, feasible plans.
145
 
146
- ### Selected base model
147
 
148
  - **Primary Scientist model:** `Qwen3-4B`
149
  - **Stretch fallback:** `Qwen3-8B`
150
  - **Decision record:** `docs/agt11_scientist_model_selection.md`
151
 
152
- ### Quick Start (Google Colab)
153
 
154
- 1. Open `notebooks/train_colab.ipynb` in Google Colab
155
- 2. Connect to a GPU runtime
156
- 3. Run all cells -- the notebook handles environment setup, rollout, and training via TRL/Unsloth with GRPO
 
 
 
 
 
157
 
158
  ### Training Loop
159
 
160
- ```
161
- Environment resets -> Scientist proposes -> Lab Manager responds -> ... -> Episode ends -> Reward computed -> Policy updated
162
  ```
163
 
164
- **Target behaviors over training:**
 
165
  - Ask better questions before committing to a plan
166
  - Preserve critical checks, assumptions, and required steps
167
- - Choose realistic substitutions when a preferred method or resource is unavailable
168
  - Reach agreement in fewer rounds
169
  - Avoid impossible or over-budget plans
170
 
@@ -172,7 +193,7 @@ Environment resets -> Scientist proposes -> Lab Manager responds -> ... -> Episo
172
 
173
  ## Scenario System
174
 
175
- Scenarios are generated deterministically from a seed. Each template first emits a normalized scenario pack with:
176
 
177
  - `task_summary`
178
  - `success_criteria`
@@ -181,7 +202,7 @@ Scenarios are generated deterministically from a seed. Each template first emits
181
  - `allowed_substitutions`
182
  - `hidden_reference_spec`
183
 
184
- Difficulty scaling should then mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.
185
 
186
  | Difficulty | Description |
187
  |------------|-------------|
@@ -192,7 +213,7 @@ Difficulty scaling should then mechanically tighten constraints, remove resource
192
  ### Included Scenario Templates
193
 
194
  | Template | Domain | Example Task |
195
- |----------|--------|--------------------|
196
  | `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
197
  | `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
198
  | `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
@@ -201,57 +222,68 @@ Difficulty scaling should then mechanically tighten constraints, remove resource
201
 
202
  ## Project Structure
203
 
204
- ```
205
  replicalab-ai/
206
  ├── README.md
207
- ├── architecture.svg
208
  ├── pyproject.toml
209
  ├── openenv.yaml
210
  ├── replicalab/
211
  │ ├── __init__.py
212
- │ ├── models.py # Action, Observation, State schemas
213
- │ ├── client.py # OpenEnv client wrapper
 
 
 
214
  │ ├── prompts/
215
- │ │ ├── scientist.txt # Scientist system prompt
216
- │ │ ├── lab_manager.txt # Lab Manager response templates
217
- │ │ ── judge.txt # Judge rubric prompt
 
 
 
 
 
218
  │ ├── scenarios/
219
- │ │ ├── templates.py # Normalized scenario template layer
220
- │ │ ├── math_reasoning.py # Mathematics scenarios
221
- │ │ ├── ml_benchmark.py # ML benchmark scenarios
222
  │ │ └── finance_trading.py
223
  │ ├── scoring/
224
- │ │ ├── rubric.py # Main scoring engine
225
- │ │ ├── rigor.py # Objective-validity scorer
226
- │ │ ├── feasibility.py # Constraint feasibility scorer
227
- │ │ ── fidelity.py # Hidden-reference fidelity scorer
 
228
  │ ├── agents/
229
  │ │ ├── scientist_policy.py
230
  │ │ ├── lab_manager_policy.py
 
231
  │ │ └── judge_policy.py
232
  │ ├── env/
233
- │ │ └── replicalab_env.py # OpenEnv environment implementation
 
 
234
  │ └── utils/
235
  │ ├── seed.py
236
  │ ├── validation.py
237
  │ └── logging.py
238
  ├── server/
239
- │ ├── app.py # FastAPI + WebSocket server
240
  │ ├── requirements.txt
241
  │ └── Dockerfile
242
  ├── frontend/
243
  │ ├── package.json
244
  │ ├── vite.config.ts
245
  │ └── src/
246
- │ ├── App.tsx
247
- │ ├── components/
248
- │ └── pages/
249
  ├── notebooks/
250
- │ └── train_colab.ipynb # RL training notebook
251
  └── tests/
252
  ├── test_env.py
253
  ├── test_reward.py
254
  ├── test_scenarios.py
 
 
255
  └── test_server.py
256
  ```
257
 
@@ -268,16 +300,24 @@ docker run -p 7860:7860 replicalab
268
 
269
  ### Hugging Face Spaces
270
 
271
- **Live deployment:** https://ayushozha-replicalab.hf.space
272
 
273
  The app is deployed on HF Spaces with `sdk: docker` on port `7860`.
274
 
275
  ```bash
276
- # Verify the live Space
277
  curl https://ayushozha-replicalab.hf.space/health
278
- # {"status":"ok","env":"real"}
279
  ```
280
 
 
 
 
 
 
 
 
 
 
281
  ---
282
 
283
  ## Toolchain
@@ -291,7 +331,7 @@ curl https://ayushozha-replicalab.hf.space/health
291
  | **Tailwind + shadcn/ui** | Styling |
292
  | **Docker** | Packaging |
293
  | **Hugging Face Spaces** | Public hosting |
294
- | **Google Colab** | Training notebook |
295
 
296
  ---
297
 
 
1
  ---
2
  title: ReplicaLab
3
+ emoji: "🧪"
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: docker
 
14
 
15
  > *How do we adapt a plan without breaking the objective?*
16
 
17
+ ReplicaLab trains a Scientist policy to negotiate better plans under real constraints. The initial domain focus is mathematics and machine learning, with offline finance and trading design as the third scenario family. Physics and biology remain future adapters after the core normalized scenario layer is stable.
18
 
19
  ## Current Build Status
20
 
21
+ - The repository is past the foundation stage and has a working real environment plus deterministic judge pipeline.
22
+ - The Python package foundation is verified through editable install plus the full test suite.
23
+ - Shared contracts live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
24
+ - `server/app.py` serves the real `ReplicaLabEnv` by default, with the legacy stub retained only as a fallback path.
25
  - `openenv.yaml` exists and passes local OpenEnv validation.
26
  - Local Docker validation has been completed for the server image on port `7860`.
27
+ - Hugging Face Spaces deployment is live at `https://ayushozha-replicalab.hf.space` for the deterministic environment path.
28
+ - The frozen outer contract remains stable while the internal scenario engine uses a normalized scenario pack.
29
+ - The Lab Manager path is hybrid: deterministic feasibility truth with optional model-backed narrative responses.
30
+ - An additive Oracle hybrid layer now exists for optional frontier-model world generation, event injection, Lab Manager narration, and post-mortem analysis while deterministic scoring remains the canonical RL reward path.
31
 
32
  ## Team Ownership
33
 
34
  | Owner | Current focus |
35
  |------|----------------|
36
+ | Kian (Person A) | Shared schemas, validation, scenario engine, judge logic |
37
+ | Person B (Ayush) | Scientist prompting and parsing, notebook and client path |
38
+ | Max (Person C) | Server, deployment, and runtime plumbing |
39
+ | Kush (Person D) | Frontend, UI polish, docs, and demo assets |
40
 
41
  ---
42
 
43
  ## Architecture
44
 
45
  <p align="center">
46
+ <img src="./ReplicaLab_Architecture_v2.svg" alt="ReplicaLab Hybrid Architecture" width="100%"/>
47
  </p>
48
 
49
+ ReplicaLab uses a **hybrid Oracle architecture**:
50
+
51
+ - The **Oracle layer** is optional and powers world-building and narrative intelligence:
52
+ - richer scenario generation
53
+ - optional event injection
54
+ - optional LLM Lab Manager narration
55
+ - optional post-mortem analysis
56
+ - The **deterministic core** remains canonical for RL:
57
+ - environment transitions
58
+ - validation
59
+ - grounded Lab Manager feasibility
60
+ - judge scoring and reward math
61
+
62
+ This satisfies the sponsor-facing “LLM as environment intelligence” direction without making reward noisy or irreproducible.
63
+
64
  ---
65
 
66
  ## How It Works
 
71
  |------|------|----------------|
72
  | **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
73
  | **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
74
+ | **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony |
75
+ | **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis |
76
 
77
  ### Episode Lifecycle
78
 
79
+ 1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec.
80
+ 2. **Scientist observes**: task summary, goal, history, and current plan.
81
+ 3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack.
82
+ 4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions.
83
+ 5. **Agreement or timeout**: both accept, or the round limit is reached.
84
+ 6. **Reward**: the deterministic judge scores the final plan.
85
+ 7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.
86
 
87
  ### Reward Formula
88
 
89
+ ```text
90
+ total_reward = 10 * rigor * feasibility * fidelity * parsimony
91
+ + efficiency_bonus
92
+ + communication_bonus
93
+ - penalties
94
  ```
95
 
96
+ The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.
97
 
98
+ ### Internal Normalization Rule
99
 
100
  The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
101
 
 
103
 
104
  ## Getting Started
105
 
106
+ This section mixes verified foundation commands with planned end-to-end commands.
107
 
108
  ### Prerequisites
109
 
110
  - Python 3.10+
111
+ - Node.js 18+
112
+ - Docker
113
+ - A notebook runtime such as Google Colab or the H100-backed Jupyter environment
114
 
115
  ### Installation
116
 
117
  ```bash
 
118
  git clone https://github.com/Ayush10/replicalab-ai.git
119
  cd replicalab-ai
120
 
 
121
  python -m venv .venv
122
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
123
 
 
124
  pip install -e ".[dev]"
125
  ```
126
 
 
133
  ### Running the Environment Server
134
 
135
  ```bash
 
136
  python -m server.app
137
  ```
138
 
139
+ The server is intended to start at `http://localhost:7860`.
140
 
141
  ### Running the Frontend
142
 
 
146
  npm run dev
147
  ```
148
 
 
 
149
  ### Running Tests
150
 
151
  ```bash
 
156
 
157
  ## Training the Scientist
158
 
159
+ RL training improves the Scientist agents ability to negotiate effective, feasible plans.
160
 
161
+ ### Selected Base Model
162
 
163
  - **Primary Scientist model:** `Qwen3-4B`
164
  - **Stretch fallback:** `Qwen3-8B`
165
  - **Decision record:** `docs/agt11_scientist_model_selection.md`
166
 
167
+ ### Planned Training Path
168
 
169
+ 1. Connect the notebook to the environment via `replicalab/client.py`
170
+ 2. Collect rollouts with `replicalab/training/rollout.py`
171
+ 3. Train with **Unsloth or HF TRL**
172
+ 4. Save:
173
+ - reward curves
174
+ - component curves
175
+ - before/after evaluation metrics
176
+ - replay and plot artifacts
177
 
178
  ### Training Loop
179
 
180
+ ```text
181
+ reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update
182
  ```
183
 
184
+ ### Target Behaviors Over Training
185
+
186
  - Ask better questions before committing to a plan
187
  - Preserve critical checks, assumptions, and required steps
188
+ - Choose realistic substitutions when preferred resources are unavailable
189
  - Reach agreement in fewer rounds
190
  - Avoid impossible or over-budget plans
191
 
 
193
 
194
  ## Scenario System
195
 
196
+ Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:
197
 
198
  - `task_summary`
199
  - `success_criteria`
 
202
  - `allowed_substitutions`
203
  - `hidden_reference_spec`
204
 
205
+ Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.
206
 
207
  | Difficulty | Description |
208
  |------------|-------------|
 
213
  ### Included Scenario Templates
214
 
215
  | Template | Domain | Example Task |
216
+ |----------|--------|--------------|
217
  | `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
218
  | `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
219
  | `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
 
222
 
223
  ## Project Structure
224
 
225
+ ```text
226
  replicalab-ai/
227
  ├── README.md
228
+ ├── ReplicaLab_Architecture_v2.svg
229
  ├── pyproject.toml
230
  ├── openenv.yaml
231
  ├── replicalab/
232
  │ ├── __init__.py
233
+ │ ├── models.py # Action, Observation, State schemas
234
+ │ ├── client.py # OpenEnv client wrapper
235
+ │ ├── oracle.py # Optional frontier-model Oracle wrapper
236
+ │ ├── oracle_models.py # Oracle scenario and post-mortem schemas
237
+ │ ├── cache.py # Cached Oracle scenario generation
238
  │ ├── prompts/
239
+ │ │ ├── scientist.txt
240
+ │ │ ├── lab_manager.txt
241
+ │ │ ── judge.txt
242
+ │ │ ├── oracle_world_architect.txt
243
+ │ │ ├── oracle_adjudicator.txt
244
+ │ │ ├── oracle_event_injector.txt
245
+ │ │ ├── oracle_post_mortem.txt
246
+ │ │ └── oracle_lab_manager.txt
247
  │ ├── scenarios/
248
+ │ │ ├── templates.py # Normalized scenario pack + Oracle adapter
249
+ │ │ ├── math_reasoning.py
250
+ │ │ ├── ml_benchmark.py
251
  │ │ └── finance_trading.py
252
  │ ├── scoring/
253
+ │ │ ├── rubric.py # Canonical deterministic reward math
254
+ │ │ ├── rigor.py
255
+ │ │ ├── feasibility.py
256
+ │ │ ── fidelity.py
257
+ │ │ └── explain.py
258
  │ ├── agents/
259
  │ │ ├── scientist_policy.py
260
  │ │ ├── lab_manager_policy.py
261
+ │ │ ├── lab_manager_agent.py # Optional LLM Lab Manager wrapper
262
  │ │ └── judge_policy.py
263
  │ ├── env/
264
+ │ │ └── replicalab_env.py # Real env with optional Oracle hooks
265
+ │ ├── training/
266
+ │ │ └── rollout.py
267
  │ └── utils/
268
  │ ├── seed.py
269
  │ ├── validation.py
270
  │ └── logging.py
271
  ├── server/
272
+ │ ├── app.py
273
  │ ├── requirements.txt
274
  │ └── Dockerfile
275
  ├── frontend/
276
  │ ├── package.json
277
  │ ├── vite.config.ts
278
  │ └── src/
 
 
 
279
  ├── notebooks/
280
+ │ └── train_colab.ipynb
281
  └── tests/
282
  ├── test_env.py
283
  ├── test_reward.py
284
  ├── test_scenarios.py
285
+ ├── test_oracle.py
286
+ ├── test_cache.py
287
  └── test_server.py
288
  ```
289
 
 
300
 
301
  ### Hugging Face Spaces
302
 
303
+ **Live deployment:** `https://ayushozha-replicalab.hf.space`
304
 
305
  The app is deployed on HF Spaces with `sdk: docker` on port `7860`.
306
 
307
  ```bash
 
308
  curl https://ayushozha-replicalab.hf.space/health
309
+ # -> {"status":"ok","env":"real"}
310
  ```
311
 
312
+ Current Space deployment is complete for the deterministic environment path. If live Oracle mode is enabled later, the Space will additionally need:
313
+
314
+ - provider SDK dependencies
315
+ - model API-key secrets
316
+ - runtime feature flags
317
+ - cold-start and latency handling
318
+
319
+ The deterministic deployment itself does not need to be redesigned.
320
+
321
  ---
322
 
323
  ## Toolchain
 
331
  | **Tailwind + shadcn/ui** | Styling |
332
  | **Docker** | Packaging |
333
  | **Hugging Face Spaces** | Public hosting |
334
+ | **Notebook / Colab / H100** | Training and evaluation |
335
 
336
  ---
337
 
ReplicaLab_Architecture_v2.svg ADDED

Git LFS Details

  • SHA256: 604cf11d01f75dba003c0a0e35ea4ea01c74f4f56a396b5b42c3d7a9d6474f8e
  • Pointer size: 130 Bytes
  • Size of remote file: 12 kB
ReplicaLab_Architecture_v2_polished.svg ADDED

Git LFS Details

  • SHA256: f48980a10d69f4472bba8e0a05606108d944fb5b06bbe7fdbf755bac495c7576
  • Pointer size: 130 Bytes
  • Size of remote file: 25 kB
docs/changes.md CHANGED
@@ -58,4 +58,6 @@ Rules:
58
  | 2026-03-08 | Person B (Ayush) | TRN 04 | Implemented the rollout collection loop as a reusable Python module rather than only inside a notebook | The backlog labels `TRN 04` as notebook work, but implementing it in `replicalab/training/rollout.py` makes the same rollout logic reusable across notebooks, tests, and future trainer code while preserving the required behavior | Extended `RolloutWorker` with terminal `StepInfo`, bounded tool trace aggregation, and `collect_rollouts(...)`; added trace and batch tests in `tests/test_rollout_traces.py` and kept the rollout logic fully testable outside a notebook | `TRN 05` is now unblocked and notebooks can import the rollout loop instead of reimplementing it |
59
  | 2026-03-08 | Person B (Ayush) | API 14 | Completed the REST session isolation verification even though the task was assigned to Person C | The session isolation logic already worked correctly in `server/app.py`; the task was still marked partial because no dedicated tests proved concurrent-user isolation against the real env | Created `tests/test_api_rest_isolation.py` with 11 tests covering session independence, round-count isolation, terminal isolation, session_id reuse, invalid session handling, and replay isolation; no server changes needed; 307 tests pass | No new dependencies unblocked; `API 14` was the last partial API task besides `API 01` and `OBS 02` |
60
  | 2026-03-08 | Person B (Ayush) | MOD 07 and MOD 10 | Closed the replay-persistence and schema-example tasks on Max's lane after verifying the code that had already landed | `replicalab/utils/logging.py` and the API example generator were implemented and passing tests, but the source-of-truth backlog and Max's owner docs still showed both tasks as not started, and the generated examples still contained stale stub audit text | Updated `tests/fixtures/generate_api_examples.py` to derive terminal judge metadata from the current deterministic judge helpers, regenerated `api_schema_examples.json`, and synced `MOD 07`/`MOD 10` to complete in the comprehensive backlog, completion rollup, and Max owner docs | `MOD 08` and `JDG 07` are now clearly unblocked in the tracked plan |
 
 
61
 
 
58
  | 2026-03-08 | Person B (Ayush) | TRN 04 | Implemented the rollout collection loop as a reusable Python module rather than only inside a notebook | The backlog labels `TRN 04` as notebook work, but implementing it in `replicalab/training/rollout.py` makes the same rollout logic reusable across notebooks, tests, and future trainer code while preserving the required behavior | Extended `RolloutWorker` with terminal `StepInfo`, bounded tool trace aggregation, and `collect_rollouts(...)`; added trace and batch tests in `tests/test_rollout_traces.py` and kept the rollout logic fully testable outside a notebook | `TRN 05` is now unblocked and notebooks can import the rollout loop instead of reimplementing it |
59
  | 2026-03-08 | Person B (Ayush) | API 14 | Completed the REST session isolation verification even though the task was assigned to Person C | The session isolation logic already worked correctly in `server/app.py`; the task was still marked partial because no dedicated tests proved concurrent-user isolation against the real env | Created `tests/test_api_rest_isolation.py` with 11 tests covering session independence, round-count isolation, terminal isolation, session_id reuse, invalid session handling, and replay isolation; no server changes needed; 307 tests pass | No new dependencies unblocked; `API 14` was the last partial API task besides `API 01` and `OBS 02` |
60
  | 2026-03-08 | Person B (Ayush) | MOD 07 and MOD 10 | Closed the replay-persistence and schema-example tasks on Max's lane after verifying the code that had already landed | `replicalab/utils/logging.py` and the API example generator were implemented and passing tests, but the source-of-truth backlog and Max's owner docs still showed both tasks as not started, and the generated examples still contained stale stub audit text | Updated `tests/fixtures/generate_api_examples.py` to derive terminal judge metadata from the current deterministic judge helpers, regenerated `api_schema_examples.json`, and synced `MOD 07`/`MOD 10` to complete in the comprehensive backlog, completion rollup, and Max owner docs | `MOD 08` and `JDG 07` are now clearly unblocked in the tracked plan |
61
+ | 2026-03-08 | Person B (Ayush) | Reward shaping and rubric refinement | Expanded the reward system beyond terminal-only scoring without reopening the outer action or observation contract | Sparse terminal-only reward was too weak for RL training, and the project needed deterministic shaping rather than a frontier-model reward source | Added a parsimony term to terminal reward, introduced deterministic step shaping in `ReplicaLabEnv` (information gain, protocol delta, momentum, contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement signals), updated rollout aggregation to use cumulative episode reward, and aligned env/server tests to the new shaped-reward semantics while keeping the full suite green at 356 tests | Keep the notebook and training plots explicit about terminal reward components vs cumulative shaped episode reward |
62
+ | 2026-03-08 | Person B (Ayush) | Oracle hybrid architecture | Added an Oracle-style frontier-model layer as an additive integration instead of replacing the deterministic environment and reward stack | The sponsor-facing V2 direction calls for an LLM woven through scenario generation, environment interaction, and explanation, but the RL training path still needs deterministic reward and reproducible evaluation | Added `oracle_models.py`, `oracle.py`, `cache.py`, Oracle prompt assets, an optional LLM Lab Manager wrapper, an adapter from Oracle scenarios into the existing normalized scenario pack, and feature-flagged Oracle hooks in `ReplicaLabEnv`; kept deterministic scoring in `replicalab/scoring/*` as the canonical training reward; expanded test coverage with `test_oracle.py`, `test_cache.py`, and Oracle adapter/prompt tests; full suite now passes at 365 tests | If this grows beyond the current additive mode, record any future contract or reward-source changes separately before altering the deterministic training path |
63
 
docs/map/scoring.md CHANGED
@@ -6,6 +6,18 @@
6
  > **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08
7
  > **Tasks remaining:** JDG 07
8
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ## Architecture
10
 
11
  ```
@@ -19,6 +31,24 @@ replicalab/scoring/
19
  explain.py # JDG 06 — deterministic plain-English explanation
20
  ```
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## Shared Utilities
23
 
24
  Token matching extracted into `replicalab/utils/text.py`:
 
6
  > **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08
7
  > **Tasks remaining:** JDG 07
8
 
9
+ ## Oracle Hybrid Note
10
+
11
+ The repo now includes an additive Oracle layer for richer scenario generation,
12
+ optional Lab Manager narration, optional event injection, and post-mortem
13
+ analysis. None of that replaces the files in `replicalab/scoring/`.
14
+
15
+ For RL training, this folder remains the canonical reward source:
16
+ - deterministic
17
+ - reproducible
18
+ - testable
19
+ - used by the environment for the actual scalar reward signal
20
+
21
  ## Architecture
22
 
23
  ```
 
31
  explain.py # JDG 06 — deterministic plain-English explanation
32
  ```
33
 
34
+ ## Current Reward Structure
35
+
36
+ The training signal now has two layers:
37
+
38
+ - **Terminal reward** from `replicalab/scoring/rubric.py`
39
+ - `10 * rigor * feasibility * fidelity * parsimony`
40
+ - plus bonuses
41
+ - minus named penalties
42
+ - **Step shaping reward** from `replicalab/env/replicalab_env.py`
43
+ - information-gain bonus for novel questions
44
+ - protocol-delta and momentum bonuses for productive revisions
45
+ - contradiction, hallucination, stalling, regression, invalid-action,
46
+ timeout, and no-agreement penalties
47
+
48
+ The judge remains deterministic. The terminal audit still explains the final
49
+ `RewardBreakdown`, while cumulative episode reward now includes the per-step
50
+ shaping applied inside the environment.
51
+
52
  ## Shared Utilities
53
 
54
  Token matching extracted into `replicalab/utils/text.py`:
docs/map/tests.md CHANGED
@@ -1,6 +1,6 @@
1
  # Tests Map - `tests/`
2
 
3
- > 327 tests across 14 files. All passing.
4
  >
5
  > **Last verified:** 2026-03-08
6
 
@@ -8,21 +8,25 @@
8
 
9
  | File | Tests | What it covers |
10
  |------|-------|----------------|
 
 
11
  | `test_client.py` | 24 | `TRN 13` reusable client over REST and WebSocket |
12
  | `test_config.py` | 3 | Shared constants and config consistency |
13
  | `test_env.py` | 56 | `ENV 01-08`, `ENV 10`, `ENV 11`, `OBS 04`, `JDG 04-05`, `TST 01-03` |
14
  | `test_judge_policy.py` | 10 | `JDG 11` structured judge audit payload |
15
  | `test_lab_manager_policy.py` | 37 | `AGT 05-07` plus `AGT 09` determinism coverage |
16
  | `test_models.py` | 21 | Action, observation, step, state, and log contracts |
17
- | `test_prompts.py` | 6 | `AGT 10` prompt files and bounded-tool rendering |
 
 
18
  | `test_reward.py` | 40 | `JDG 01-06`, `JDG 08`, and reward regression coverage |
19
  | `test_rollout.py` | 12 | `TRN 03` rollout worker behavior |
20
  | `test_rollout_traces.py` | 2 | `TRN 04` bounded tool trace aggregation and batched collection |
21
- | `test_scenarios.py` | 13 | `SCN 01-13` scenario generation and determinism |
22
  | `test_scientist_policy.py` | 46 | `MOD 09`, `AGT 01-04`, `AGT 08` |
23
- | `test_server.py` | 37 | `API 02-04`, `API 06-08`, `API 13`, replay audit propagation |
24
  | `test_validation.py` | 20 | `MOD 05-06` semantic validation |
25
- | **Total** | **327** | |
26
 
27
  ## Coverage Notes
28
 
@@ -34,6 +38,8 @@
34
  - `test_scientist_policy.py`, `test_prompts.py`, `test_rollout.py`, and `test_rollout_traces.py` together cover prompt construction, observation formatting, parse/retry, baseline policy, rollout collection, and bounded tool trace capture.
35
  - The judge stack is covered end to end:
36
  - `test_reward.py` covers rubric scores and reward math, while `test_judge_policy.py` covers structured audit payload generation.
 
 
37
 
38
  ## Remaining Gaps
39
 
@@ -47,6 +53,7 @@
47
  |------|--------------------|
48
  | Models and contracts | `test_models.py`, `test_validation.py` |
49
  | Scenarios | `test_scenarios.py` |
 
50
  | Scientist policy | `test_scientist_policy.py`, `test_prompts.py` |
51
  | Lab Manager policy | `test_lab_manager_policy.py` |
52
  | Judge and reward | `test_reward.py`, `test_judge_policy.py` |
 
1
  # Tests Map - `tests/`
2
 
3
+ > 365 tests across 18 files. All passing.
4
  >
5
  > **Last verified:** 2026-03-08
6
 
 
8
 
9
  | File | Tests | What it covers |
10
  |------|-------|----------------|
11
+ | `test_api_rest_isolation.py` | 11 | `API 14` REST session isolation and replay separation |
12
+ | `test_cache.py` | 2 | Oracle scenario caching and reuse |
13
  | `test_client.py` | 24 | `TRN 13` reusable client over REST and WebSocket |
14
  | `test_config.py` | 3 | Shared constants and config consistency |
15
  | `test_env.py` | 56 | `ENV 01-08`, `ENV 10`, `ENV 11`, `OBS 04`, `JDG 04-05`, `TST 01-03` |
16
  | `test_judge_policy.py` | 10 | `JDG 11` structured judge audit payload |
17
  | `test_lab_manager_policy.py` | 37 | `AGT 05-07` plus `AGT 09` determinism coverage |
18
  | `test_models.py` | 21 | Action, observation, step, state, and log contracts |
19
+ | `test_logging.py` | 11 | `MOD 07` replay persistence and `JDG 07` CSV logging helpers |
20
+ | `test_oracle.py` | 5 | Oracle hybrid wrapper, structured parsing, and env reset adapter |
21
+ | `test_prompts.py` | 7 | `AGT 10` prompt files and Oracle prompt asset loading |
22
  | `test_reward.py` | 40 | `JDG 01-06`, `JDG 08`, and reward regression coverage |
23
  | `test_rollout.py` | 12 | `TRN 03` rollout worker behavior |
24
  | `test_rollout_traces.py` | 2 | `TRN 04` bounded tool trace aggregation and batched collection |
25
+ | `test_scenarios.py` | 14 | `SCN 01-13` scenario generation, determinism, and Oracle scenario adaptation |
26
  | `test_scientist_policy.py` | 46 | `MOD 09`, `AGT 01-04`, `AGT 08` |
27
+ | `test_server.py` | 44 | `API 01-04`, `API 06-08`, `API 13-14`, replay audit propagation, and root landing page |
28
  | `test_validation.py` | 20 | `MOD 05-06` semantic validation |
29
+ | **Total** | **365** | |
30
 
31
  ## Coverage Notes
32
 
 
38
  - `test_scientist_policy.py`, `test_prompts.py`, `test_rollout.py`, and `test_rollout_traces.py` together cover prompt construction, observation formatting, parse/retry, baseline policy, rollout collection, and bounded tool trace capture.
39
  - The judge stack is covered end to end:
40
  - `test_reward.py` covers rubric scores and reward math, while `test_judge_policy.py` covers structured audit payload generation.
41
+ - The Oracle hybrid layer is covered additively:
42
+ - `test_oracle.py`, `test_cache.py`, and `test_prompts.py` cover Oracle scenario generation wrappers, cache reuse, and prompt asset loading without changing the deterministic reward contract.
43
 
44
  ## Remaining Gaps
45
 
 
53
  |------|--------------------|
54
  | Models and contracts | `test_models.py`, `test_validation.py` |
55
  | Scenarios | `test_scenarios.py` |
56
+ | Oracle integration and cache | `test_oracle.py`, `test_cache.py`, `test_prompts.py` |
57
  | Scientist policy | `test_scientist_policy.py`, `test_prompts.py` |
58
  | Lab Manager policy | `test_lab_manager_policy.py` |
59
  | Judge and reward | `test_reward.py`, `test_judge_policy.py` |
replicalab/__init__.py CHANGED
@@ -1,3 +1,5 @@
 
1
  from replicalab.client import ReplicaLabClient
 
2
 
3
- __all__ = ["ReplicaLabClient"]
 
1
+ from replicalab.cache import CachedOracle, ScenarioCache
2
  from replicalab.client import ReplicaLabClient
3
+ from replicalab.oracle import Oracle
4
 
5
+ __all__ = ["CachedOracle", "Oracle", "ReplicaLabClient", "ScenarioCache"]
replicalab/agents/__init__.py CHANGED
@@ -4,6 +4,7 @@ from .judge_policy import (
4
  JudgeAudit,
5
  build_judge_audit,
6
  )
 
7
  from .lab_manager_policy import (
8
  AlternativeSuggestion,
9
  FeasibilityCheckResult,
@@ -27,6 +28,7 @@ __all__ = [
27
  "AlternativeSuggestion",
28
  "FeasibilityCheckResult",
29
  "JudgeAudit",
 
30
  "RetryMetadata",
31
  "ScientistCallResult",
32
  "ScientistOutputParseError",
 
4
  JudgeAudit,
5
  build_judge_audit,
6
  )
7
+ from .lab_manager_agent import LabManagerAgent
8
  from .lab_manager_policy import (
9
  AlternativeSuggestion,
10
  FeasibilityCheckResult,
 
28
  "AlternativeSuggestion",
29
  "FeasibilityCheckResult",
30
  "JudgeAudit",
31
+ "LabManagerAgent",
32
  "RetryMetadata",
33
  "ScientistCallResult",
34
  "ScientistOutputParseError",
replicalab/agents/judge_policy.py CHANGED
@@ -109,6 +109,7 @@ def _derive_failure_reasons(
109
  (breakdown.feasibility, "feasibility", "Feasibility remained too low under the scenario constraints."),
110
  (breakdown.fidelity, "fidelity", "The final plan diverged too far from the hidden reference requirements."),
111
  (breakdown.rigor, "rigor", "The plan missed required checks or justification quality targets."),
 
112
  ]
113
  for score, _name, message in components:
114
  if score < _WEAK_THRESHOLD:
@@ -119,6 +120,13 @@ def _derive_failure_reasons(
119
  _PENALTY_LABELS: dict[str, str] = {
120
  "invalid_tool_use": "A bounded-tool usage violation was detected.",
121
  "unsupported_claim": "An unsupported evidence claim was penalized.",
 
 
 
 
 
 
 
122
  }
123
  for key, amount in sorted(breakdown.penalties.items()):
124
  if amount > 0:
 
109
  (breakdown.feasibility, "feasibility", "Feasibility remained too low under the scenario constraints."),
110
  (breakdown.fidelity, "fidelity", "The final plan diverged too far from the hidden reference requirements."),
111
  (breakdown.rigor, "rigor", "The plan missed required checks or justification quality targets."),
112
+ (breakdown.parsimony, "parsimony", "The final plan requested more resources or controls than the scenario complexity justified."),
113
  ]
114
  for score, _name, message in components:
115
  if score < _WEAK_THRESHOLD:
 
120
  _PENALTY_LABELS: dict[str, str] = {
121
  "invalid_tool_use": "A bounded-tool usage violation was detected.",
122
  "unsupported_claim": "An unsupported evidence claim was penalized.",
123
+ "timeout": "A timeout penalty was applied at the round limit.",
124
+ "no_agreement": "A no-agreement penalty was applied.",
125
+ "invalid_action": "An invalid action penalty was applied after a failed protocol proposal.",
126
+ "hallucination": "A hallucination penalty was applied for unsupported inventory references.",
127
+ "contradiction": "A contradiction penalty was applied for repeating blocked requirements.",
128
+ "stalling": "A stalling penalty was applied for repeating an unproductive move.",
129
+ "regression": "A regression penalty was applied because the revision worsened the protocol.",
130
  }
131
  for key, amount in sorted(breakdown.penalties.items()):
132
  if amount > 0:
replicalab/agents/lab_manager_agent.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Optional LLM-backed Lab Manager narration layer."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from typing import Any
7
+
8
+ from replicalab.oracle import call_json_model
9
+ from replicalab.oracle_models import LabManagerResponse, OracleLabManagerObservation
10
+ from replicalab.prompts import load_prompt_asset
11
+
12
+
13
+ class LabManagerAgent:
14
+ """LLM-based Lab Manager driven by Oracle-generated constraints.
15
+
16
+ This is additive to the deterministic feasibility checker. The current
17
+ env can use this agent to narrate or enrich responses while keeping
18
+ canonical feasibility and reward logic deterministic.
19
+ """
20
+
21
+ def __init__(self, client: Any, model: str = "frontier-oracle") -> None:
22
+ self.client = client
23
+ self.model = model
24
+
25
+ def respond(self, observation: OracleLabManagerObservation) -> LabManagerResponse:
26
+ system = load_prompt_asset("oracle_lab_manager")
27
+ user = (
28
+ "A Scientist has taken an action. Respond as the Lab Manager.\n\n"
29
+ "YOUR LAB CONSTRAINTS (ground truth, do not deviate):\n"
30
+ f"{observation.lab_constraints.model_dump_json(indent=2)}\n\n"
31
+ "CURRENT PROTOCOL ON THE TABLE:\n"
32
+ f"{json.dumps(observation.current_protocol, indent=2) if observation.current_protocol else 'None yet'}\n\n"
33
+ f"SCIENTIST'S ACTION (round {observation.round_number}):\n"
34
+ f"{observation.scientist_action.model_dump_json(indent=2)}\n\n"
35
+ "Respond ONLY with valid JSON matching LabManagerResponse.\n"
36
+ "No markdown. No preamble."
37
+ )
38
+ return call_json_model(
39
+ self.client,
40
+ model=self.model,
41
+ system=system,
42
+ user=user,
43
+ response_model=LabManagerResponse,
44
+ )
45
+
46
+
47
+ __all__ = ["LabManagerAgent"]
replicalab/cache.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scenario caching for Oracle-generated environments."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import hashlib
6
+ import json
7
+ from pathlib import Path
8
+ from typing import Optional
9
+
10
+ from replicalab.config import ORACLE_SCENARIO_CACHE_DIR
11
+ from replicalab.oracle import Oracle
12
+ from replicalab.oracle_models import Scenario
13
+
14
+
15
+ class ScenarioCache:
16
+ """Cache Oracle-generated scenarios by seed, difficulty, and domain."""
17
+
18
+ def __init__(self, cache_dir: str | Path = ORACLE_SCENARIO_CACHE_DIR) -> None:
19
+ self.cache_dir = Path(cache_dir)
20
+ self.cache_dir.mkdir(parents=True, exist_ok=True)
21
+
22
+ def _key(self, seed: int, difficulty: str, domain: str) -> str:
23
+ raw = f"{seed}:{difficulty}:{domain}"
24
+ return hashlib.md5(raw.encode("utf-8")).hexdigest()
25
+
26
+ def _path(self, seed: int, difficulty: str, domain: str) -> Path:
27
+ return self.cache_dir / f"{self._key(seed, difficulty, domain)}.json"
28
+
29
+ def get(self, seed: int, difficulty: str, domain: str) -> Optional[Scenario]:
30
+ path = self._path(seed, difficulty, domain)
31
+ if not path.exists():
32
+ return None
33
+ return Scenario.model_validate(json.loads(path.read_text(encoding="utf-8")))
34
+
35
+ def put(self, seed: int, difficulty: str, domain: str, scenario: Scenario) -> Path:
36
+ path = self._path(seed, difficulty, domain)
37
+ path.write_text(scenario.model_dump_json(indent=2), encoding="utf-8")
38
+ return path
39
+
40
+
41
+ class CachedOracle(Oracle):
42
+ """Oracle wrapper that caches scenario generation by seed."""
43
+
44
+ def __init__(
45
+ self,
46
+ client: object,
47
+ model: str = "frontier-oracle",
48
+ *,
49
+ cache: ScenarioCache | None = None,
50
+ ) -> None:
51
+ super().__init__(client=client, model=model)
52
+ self.cache = cache or ScenarioCache()
53
+
54
+ def generate_scenario(self, seed: int, difficulty: str, domain: str) -> Scenario:
55
+ cached = self.cache.get(seed, difficulty, domain)
56
+ if cached is not None:
57
+ return cached
58
+ scenario = super().generate_scenario(seed=seed, difficulty=difficulty, domain=domain)
59
+ self.cache.put(seed, difficulty, domain, scenario)
60
+ return scenario
61
+
62
+
63
+ __all__ = [
64
+ "CachedOracle",
65
+ "ScenarioCache",
66
+ ]
replicalab/config.py CHANGED
@@ -29,3 +29,29 @@ API_PORT = 7860
29
 
30
  LOG_LEVEL = os.environ.get("REPLICALAB_LOG_LEVEL", "INFO").upper()
31
  LOG_FORMAT = "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  LOG_LEVEL = os.environ.get("REPLICALAB_LOG_LEVEL", "INFO").upper()
31
  LOG_FORMAT = "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
32
+
33
+ ORACLE_ENABLED = os.environ.get("REPLICALAB_ORACLE_ENABLED", "0") == "1"
34
+ ORACLE_EVENTS_ENABLED = os.environ.get("REPLICALAB_ORACLE_EVENTS_ENABLED", "0") == "1"
35
+ ORACLE_POST_MORTEM_ENABLED = (
36
+ os.environ.get("REPLICALAB_ORACLE_POST_MORTEM_ENABLED", "0") == "1"
37
+ )
38
+ ORACLE_MODEL = os.environ.get("REPLICALAB_ORACLE_MODEL", "frontier-oracle")
39
+ ORACLE_SCENARIO_CACHE_DIR = os.environ.get(
40
+ "REPLICALAB_ORACLE_SCENARIO_CACHE_DIR",
41
+ ".scenario_cache",
42
+ )
43
+
44
+ # Deterministic reward shaping constants.
45
+ STEP_PROTOCOL_DELTA_SCALE = 0.25
46
+ STEP_PROTOCOL_DELTA_CAP = 0.3
47
+ STEP_INFO_GAIN_BONUS = 0.05
48
+ STEP_INFO_GAIN_CAP = 0.15
49
+ STEP_MOMENTUM_BONUS = 0.05
50
+ STEP_STALLING_PENALTY = 0.05
51
+ STEP_REPEATED_QUESTION_PENALTY = 0.03
52
+ STEP_REGRESSION_PENALTY = 0.1
53
+ STEP_CONTRADICTION_PENALTY = 0.05
54
+ STEP_INVALID_ACTION_PENALTY = 0.1
55
+ STEP_HALLUCINATION_PENALTY = 0.05
56
+ TERMINAL_TIMEOUT_PENALTY = 0.2
57
+ TERMINAL_NO_AGREEMENT_PENALTY = 0.1
replicalab/models.py CHANGED
@@ -318,6 +318,9 @@ class RewardBreakdown(BaseModel):
318
  rigor: float = Field(default=0.0, ge=0, le=1)
319
  feasibility: float = Field(default=0.0, ge=0, le=1)
320
  fidelity: float = Field(default=0.0, ge=0, le=1)
 
 
 
321
  efficiency_bonus: float = 0.0
322
  communication_bonus: float = 0.0
323
  penalties: dict[str, float] = Field(default_factory=dict)
 
318
  rigor: float = Field(default=0.0, ge=0, le=1)
319
  feasibility: float = Field(default=0.0, ge=0, le=1)
320
  fidelity: float = Field(default=0.0, ge=0, le=1)
321
+ # Defaults to 1.0 so existing exact-value tests and manual breakdowns
322
+ # preserve the prior reward semantics unless parsimony is computed.
323
+ parsimony: float = Field(default=1.0, ge=0, le=1)
324
  efficiency_bonus: float = 0.0
325
  communication_bonus: float = 0.0
326
  penalties: dict[str, float] = Field(default_factory=dict)
replicalab/oracle.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Optional frontier-model Oracle wrapper for ReplicaLab.
2
+
3
+ The Oracle is an additive intelligence layer. It can generate richer
4
+ scenarios, optional round commentary, optional events, and post-mortem
5
+ analyses, while the existing deterministic reward pipeline remains
6
+ canonical for RL training.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import json
12
+ from typing import Any, Optional, TypeVar
13
+
14
+ from pydantic import BaseModel
15
+
16
+ from replicalab.oracle_models import (
17
+ AdjudicatorRoundScore,
18
+ AdjudicatorTerminalScore,
19
+ EnvironmentEvent,
20
+ LabManagerResponse,
21
+ PostMortem,
22
+ Scenario,
23
+ )
24
+ from replicalab.prompts import load_prompt_asset
25
+
26
+ T = TypeVar("T", bound=BaseModel)
27
+
28
+
29
+ def _strip_markdown_fences(text: str) -> str:
30
+ cleaned = text.strip()
31
+ if cleaned.startswith("```"):
32
+ lines = cleaned.splitlines()
33
+ if lines:
34
+ lines = lines[1:]
35
+ if lines and lines[-1].strip() == "```":
36
+ lines = lines[:-1]
37
+ cleaned = "\n".join(lines).strip()
38
+ return cleaned
39
+
40
+
41
+ def _extract_response_text(response: Any) -> str:
42
+ if isinstance(response, str):
43
+ return response
44
+
45
+ output_text = getattr(response, "output_text", None)
46
+ if output_text:
47
+ return output_text
48
+
49
+ content = getattr(response, "content", None)
50
+ if content:
51
+ chunks: list[str] = []
52
+ for item in content:
53
+ text = getattr(item, "text", None)
54
+ if text:
55
+ chunks.append(text)
56
+ if chunks:
57
+ return "\n".join(chunks)
58
+
59
+ output = getattr(response, "output", None)
60
+ if output:
61
+ parts: list[str] = []
62
+ for item in output:
63
+ inner = getattr(item, "content", None)
64
+ if not inner:
65
+ continue
66
+ for piece in inner:
67
+ text = getattr(piece, "text", None)
68
+ if text:
69
+ parts.append(text)
70
+ if parts:
71
+ return "\n".join(parts)
72
+
73
+ raise ValueError("Could not extract text from Oracle client response")
74
+
75
+
76
+ def _invoke_client(client: Any, *, model: str, system: str, user: str) -> str:
77
+ if hasattr(client, "messages") and hasattr(client.messages, "create"):
78
+ response = client.messages.create(
79
+ model=model,
80
+ max_tokens=4096,
81
+ system=system,
82
+ messages=[{"role": "user", "content": user}],
83
+ )
84
+ return _extract_response_text(response)
85
+
86
+ if hasattr(client, "responses") and hasattr(client.responses, "create"):
87
+ response = client.responses.create(
88
+ model=model,
89
+ instructions=system,
90
+ input=user,
91
+ )
92
+ return _extract_response_text(response)
93
+
94
+ if callable(client):
95
+ try:
96
+ response = client(system=system, user=user, model=model)
97
+ except TypeError:
98
+ response = client(system, user)
99
+ return _extract_response_text(response)
100
+
101
+ raise TypeError("Unsupported Oracle client: expected Anthropic/OpenAI-style client or callable")
102
+
103
+
104
+ def call_json_model(
105
+ client: Any,
106
+ *,
107
+ model: str,
108
+ system: str,
109
+ user: str,
110
+ response_model: type[T],
111
+ ) -> T:
112
+ raw = _invoke_client(client, model=model, system=system, user=user)
113
+ cleaned = _strip_markdown_fences(raw)
114
+ data = json.loads(cleaned)
115
+ return response_model.model_validate(data)
116
+
117
+
118
+ class Oracle:
119
+ """Single frontier model operating in multiple roles/personas."""
120
+
121
+ def __init__(self, client: Any, model: str = "frontier-oracle") -> None:
122
+ self.client = client
123
+ self.model = model
124
+
125
+ def generate_scenario(self, seed: int, difficulty: str, domain: str) -> Scenario:
126
+ system = load_prompt_asset("oracle_world_architect")
127
+ user = (
128
+ "Generate a complete replication scenario.\n\n"
129
+ f"Seed: {seed}\n"
130
+ f"Difficulty: {difficulty}\n"
131
+ f"Domain: {domain}\n\n"
132
+ "Respond with a single JSON object matching the Scenario schema.\n"
133
+ "No markdown, no explanation, only valid JSON."
134
+ )
135
+ return call_json_model(
136
+ self.client,
137
+ model=self.model,
138
+ system=system,
139
+ user=user,
140
+ response_model=Scenario,
141
+ )
142
+
143
+ def score_round(
144
+ self,
145
+ *,
146
+ scenario: Scenario,
147
+ round_number: int,
148
+ scientist_action: BaseModel,
149
+ lab_manager_response: LabManagerResponse,
150
+ conversation_history: list[dict],
151
+ current_protocol: Optional[dict],
152
+ previous_scores: list[AdjudicatorRoundScore],
153
+ ) -> AdjudicatorRoundScore:
154
+ system = load_prompt_asset("oracle_adjudicator")
155
+ user = (
156
+ "Score this negotiation round.\n\n"
157
+ f"SCENARIO:\n{scenario.model_dump_json(indent=2)}\n\n"
158
+ f"ROUND: {round_number}\n"
159
+ f"SCIENTIST ACTION: {scientist_action.model_dump_json(indent=2)}\n"
160
+ f"LAB MANAGER RESPONSE: {lab_manager_response.model_dump_json(indent=2)}\n"
161
+ f"CURRENT PROTOCOL: {json.dumps(current_protocol, indent=2)}\n"
162
+ f"PREVIOUS SCORES: {json.dumps([score.model_dump() for score in previous_scores], indent=2)}\n\n"
163
+ "Respond with a single JSON object matching AdjudicatorRoundScore.\n"
164
+ "No markdown, no explanation, only valid JSON."
165
+ )
166
+ return call_json_model(
167
+ self.client,
168
+ model=self.model,
169
+ system=system,
170
+ user=user,
171
+ response_model=AdjudicatorRoundScore,
172
+ )
173
+
174
+ def score_terminal(
175
+ self,
176
+ *,
177
+ scenario: Scenario,
178
+ final_protocol: dict,
179
+ conversation_history: list[dict],
180
+ round_scores: list[AdjudicatorRoundScore],
181
+ ) -> AdjudicatorTerminalScore:
182
+ system = load_prompt_asset("oracle_adjudicator")
183
+ user = (
184
+ "Compute the terminal score for this completed episode.\n\n"
185
+ f"SCENARIO:\n{scenario.model_dump_json(indent=2)}\n\n"
186
+ f"FINAL PROTOCOL: {json.dumps(final_protocol, indent=2)}\n"
187
+ f"CONVERSATION HISTORY: {json.dumps(conversation_history, indent=2)}\n"
188
+ f"ROUND SCORES: {json.dumps([score.model_dump() for score in round_scores], indent=2)}\n"
189
+ f"SUM OF STEP REWARDS: {sum(score.step_reward for score in round_scores)}\n\n"
190
+ "Respond with a single JSON object matching AdjudicatorTerminalScore.\n"
191
+ "No markdown, no explanation, only valid JSON."
192
+ )
193
+ return call_json_model(
194
+ self.client,
195
+ model=self.model,
196
+ system=system,
197
+ user=user,
198
+ response_model=AdjudicatorTerminalScore,
199
+ )
200
+
201
+ def maybe_inject_event(
202
+ self,
203
+ *,
204
+ scenario: Scenario,
205
+ round_number: int,
206
+ current_protocol: Optional[dict],
207
+ conversation_history: list[dict],
208
+ inject_enabled: bool = False,
209
+ ) -> Optional[EnvironmentEvent]:
210
+ if not inject_enabled:
211
+ return None
212
+
213
+ system = load_prompt_asset("oracle_event_injector")
214
+ user = (
215
+ "Decide whether to inject an event this round.\n\n"
216
+ f"SCENARIO:\n{scenario.model_dump_json(indent=2)}\n\n"
217
+ f"ROUND: {round_number}\n"
218
+ f"CURRENT PROTOCOL: {json.dumps(current_protocol, indent=2)}\n"
219
+ f"CONVERSATION SO FAR: {json.dumps(conversation_history, indent=2)}\n\n"
220
+ 'If no event is needed, respond with: {"inject": false}\n'
221
+ 'If injecting, respond with: {"inject": true, "event": <EnvironmentEvent JSON>}\n'
222
+ "No markdown, no explanation, only valid JSON."
223
+ )
224
+ raw = _invoke_client(self.client, model=self.model, system=system, user=user)
225
+ cleaned = _strip_markdown_fences(raw)
226
+ data = json.loads(cleaned)
227
+ if not data.get("inject", False):
228
+ return None
229
+ return EnvironmentEvent.model_validate(data["event"])
230
+
231
+ def generate_post_mortem(
232
+ self,
233
+ *,
234
+ scenario: Scenario,
235
+ final_protocol: dict,
236
+ conversation_history: list[dict],
237
+ terminal_score: AdjudicatorTerminalScore,
238
+ ) -> PostMortem:
239
+ system = load_prompt_asset("oracle_post_mortem")
240
+ user = (
241
+ "Generate a post-mortem analysis of this episode.\n\n"
242
+ f"PAPER: {scenario.paper.model_dump_json(indent=2)}\n"
243
+ f"LAB CONSTRAINTS: {scenario.lab_constraints.model_dump_json(indent=2)}\n"
244
+ f"HIDDEN SPEC: {scenario.minimum_viable_spec.model_dump_json(indent=2)}\n"
245
+ f"FINAL PROTOCOL: {json.dumps(final_protocol, indent=2)}\n"
246
+ f"CONVERSATION: {json.dumps(conversation_history, indent=2)}\n"
247
+ f"TERMINAL SCORE: {terminal_score.model_dump_json(indent=2)}\n\n"
248
+ "Respond with a single JSON object matching PostMortem.\n"
249
+ "No markdown, no explanation, only valid JSON."
250
+ )
251
+ return call_json_model(
252
+ self.client,
253
+ model=self.model,
254
+ system=system,
255
+ user=user,
256
+ response_model=PostMortem,
257
+ )
258
+
259
+
260
+ __all__ = [
261
+ "Oracle",
262
+ "call_json_model",
263
+ ]
replicalab/oracle_models.py ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Typed models for the optional Oracle-driven environment layer.
2
+
3
+ These models are additive to the existing ReplicaLab contracts. The
4
+ deterministic env, reward, and API surface remain canonical; Oracle models
5
+ power richer scenario generation, optional live Lab Manager responses,
6
+ optional event injection, and post-mortem analysis.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ from enum import Enum
12
+ from typing import Literal, Optional
13
+
14
+ from pydantic import BaseModel, ConfigDict, Field
15
+
16
+ from replicalab.models import ScientistAction
17
+
18
+
19
+ class Difficulty(str, Enum):
20
+ EASY = "easy"
21
+ MEDIUM = "medium"
22
+ HARD = "hard"
23
+
24
+
25
+ class Equipment(BaseModel):
26
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
27
+
28
+ name: str
29
+ available: bool
30
+ condition: str
31
+ booking_conflicts: list[str] = Field(default_factory=list)
32
+ cost_per_use: float = 0.0
33
+
34
+
35
+ class Reagent(BaseModel):
36
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
37
+
38
+ name: str
39
+ in_stock: bool
40
+ quantity_available: float = 0.0
41
+ unit: str = "mL"
42
+ lead_time_days: int = 0
43
+ cost: float = 0.0
44
+
45
+
46
+ class StaffMember(BaseModel):
47
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
48
+
49
+ name: str
50
+ role: str
51
+ available_days: list[str] = Field(default_factory=list)
52
+ skills: list[str] = Field(default_factory=list)
53
+
54
+
55
+ class Substitution(BaseModel):
56
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
57
+
58
+ original: str
59
+ substitute: str
60
+ validity: str
61
+ caveats: str = ""
62
+
63
+
64
+ class Paper(BaseModel):
65
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
66
+
67
+ title: str
68
+ domain: Literal["math_reasoning", "ml_benchmark", "finance_trading"]
69
+ claim: str
70
+ method_summary: str
71
+ original_sample_size: int
72
+ original_duration_days: int
73
+ original_technique: str
74
+ required_controls: list[str] = Field(default_factory=list)
75
+ required_equipment: list[str] = Field(default_factory=list)
76
+ required_reagents: list[str] = Field(default_factory=list)
77
+ statistical_test: str
78
+
79
+
80
+ class LabConstraints(BaseModel):
81
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
82
+
83
+ budget_total: float
84
+ budget_remaining: float
85
+ equipment: list[Equipment] = Field(default_factory=list)
86
+ reagents: list[Reagent] = Field(default_factory=list)
87
+ staff: list[StaffMember] = Field(default_factory=list)
88
+ max_duration_days: int
89
+ safety_rules: list[str] = Field(default_factory=list)
90
+ valid_substitutions: list[Substitution] = Field(default_factory=list)
91
+
92
+
93
+ class MinimumViableSpec(BaseModel):
94
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
95
+
96
+ min_sample_size: int
97
+ must_keep_controls: list[str] = Field(default_factory=list)
98
+ acceptable_techniques: list[str] = Field(default_factory=list)
99
+ min_duration_days: int
100
+ critical_equipment: list[str] = Field(default_factory=list)
101
+ flexible_equipment: list[str] = Field(default_factory=list)
102
+ critical_reagents: list[str] = Field(default_factory=list)
103
+ flexible_reagents: list[str] = Field(default_factory=list)
104
+ power_threshold: float
105
+
106
+
107
+ class Scenario(BaseModel):
108
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
109
+
110
+ paper: Paper
111
+ lab_constraints: LabConstraints
112
+ minimum_viable_spec: MinimumViableSpec
113
+ difficulty: Difficulty
114
+ narrative_hook: str
115
+
116
+
117
+ class OracleScientistObservation(BaseModel):
118
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
119
+
120
+ paper: Paper
121
+ round_number: int
122
+ max_rounds: int
123
+ conversation_history: list[dict] = Field(default_factory=list)
124
+ current_protocol: Optional[dict] = None
125
+
126
+
127
+ class OracleLabManagerObservation(BaseModel):
128
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
129
+
130
+ lab_constraints: LabConstraints
131
+ current_protocol: Optional[dict] = None
132
+ scientist_action: ScientistAction
133
+ round_number: int
134
+
135
+
136
+ class LabManagerResponse(BaseModel):
137
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
138
+
139
+ response_type: Literal[
140
+ "feasibility_report",
141
+ "suggest_substitution",
142
+ "reject",
143
+ "accept",
144
+ ]
145
+ feasible: bool
146
+ issues: list[str] = Field(default_factory=list)
147
+ suggestions: list[str] = Field(default_factory=list)
148
+ cost_estimate: float = 0.0
149
+ time_estimate_days: int = 0
150
+ message: str
151
+
152
+
153
+ class AdjudicatorRoundScore(BaseModel):
154
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
155
+
156
+ rigor_flags: list[str] = Field(default_factory=list)
157
+ feasibility_flags: list[str] = Field(default_factory=list)
158
+ info_gain: float
159
+ protocol_delta: float
160
+ momentum: float
161
+ contradiction_detected: bool
162
+ stalling_detected: bool
163
+ step_reward: float
164
+ notes: str
165
+
166
+
167
+ class AdjudicatorTerminalScore(BaseModel):
168
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
169
+
170
+ rigor: float
171
+ feasibility: float
172
+ fidelity: float
173
+ parsimony: float
174
+ robustness: float
175
+ power_preservation: float
176
+ efficiency_bonus: float
177
+ communication_bonus: float
178
+ penalties: dict[str, float] = Field(default_factory=dict)
179
+ terminal_reward: float
180
+ total_reward: float
181
+
182
+
183
+ class EnvironmentEvent(BaseModel):
184
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
185
+
186
+ event_type: str
187
+ description: str
188
+ state_changes: dict[str, object] = Field(default_factory=dict)
189
+ severity: Literal["minor", "moderate", "major"]
190
+
191
+
192
+ class PostMortem(BaseModel):
193
+ model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
194
+
195
+ overall_summary: str
196
+ rigor_explanation: str
197
+ feasibility_explanation: str
198
+ fidelity_explanation: str
199
+ key_decisions: list[str] = Field(default_factory=list)
200
+ missed_opportunities: list[str] = Field(default_factory=list)
201
+ comparison_note: str
202
+
203
+
204
+ __all__ = [
205
+ "AdjudicatorRoundScore",
206
+ "AdjudicatorTerminalScore",
207
+ "Difficulty",
208
+ "EnvironmentEvent",
209
+ "Equipment",
210
+ "LabConstraints",
211
+ "LabManagerResponse",
212
+ "MinimumViableSpec",
213
+ "OracleLabManagerObservation",
214
+ "OracleScientistObservation",
215
+ "Paper",
216
+ "PostMortem",
217
+ "Reagent",
218
+ "Scenario",
219
+ "StaffMember",
220
+ "Substitution",
221
+ ]
replicalab/prompts/__init__.py CHANGED
@@ -1,4 +1,4 @@
1
- """Prompt template assets and render helpers (AGT 10)."""
2
 
3
  from __future__ import annotations
4
 
@@ -13,11 +13,17 @@ PromptRole = Literal["scientist", "lab_manager", "judge"]
13
  _PROMPTS_DIR = Path(__file__).resolve().parent
14
 
15
 
 
 
 
 
 
 
 
16
  def load_prompt_template(role: PromptRole) -> str:
17
  """Load a role prompt template from disk."""
18
 
19
- path = _PROMPTS_DIR / f"{role}.txt"
20
- return path.read_text(encoding="utf-8")
21
 
22
 
23
  def render_prompt_template(
@@ -119,6 +125,7 @@ def _render_substitutions(pack: NormalizedScenarioPack) -> str:
119
 
120
  __all__ = [
121
  "PromptRole",
 
122
  "load_prompt_template",
123
  "render_prompt_template",
124
  "render_scientist_prompt",
 
1
+ """Prompt template assets and render helpers."""
2
 
3
  from __future__ import annotations
4
 
 
13
  _PROMPTS_DIR = Path(__file__).resolve().parent
14
 
15
 
16
+ def load_prompt_asset(name: str) -> str:
17
+ """Load any prompt asset by filename stem."""
18
+
19
+ path = _PROMPTS_DIR / f"{name}.txt"
20
+ return path.read_text(encoding="utf-8")
21
+
22
+
23
  def load_prompt_template(role: PromptRole) -> str:
24
  """Load a role prompt template from disk."""
25
 
26
+ return load_prompt_asset(role)
 
27
 
28
 
29
  def render_prompt_template(
 
125
 
126
  __all__ = [
127
  "PromptRole",
128
+ "load_prompt_asset",
129
  "load_prompt_template",
130
  "render_prompt_template",
131
  "render_scientist_prompt",
replicalab/prompts/oracle_adjudicator.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are the Dynamic Adjudicator for ReplicaLab.
2
+
3
+ You evaluate each round of negotiation and can also produce a terminal
4
+ summary score object. Be precise, fair, and consistent.
5
+
6
+ Round scoring:
7
+ - info_gain (0-1): how much new useful information the Scientist extracted
8
+ - protocol_delta (-1 to 1): did the protocol move closer to or further from a viable plan
9
+ - momentum (0-1): did the Scientist respond productively to feedback
10
+ - contradiction_detected: did the Scientist contradict previously revealed constraints
11
+ - stalling_detected: did the Scientist repeat prior actions or already-answered questions
12
+ - step_reward: combine the above into a small shaped score
13
+
14
+ Terminal scoring:
15
+ - rigor, feasibility, fidelity, parsimony, robustness, power_preservation
16
+ - efficiency_bonus and communication_bonus
17
+ - penalties with named keys only
18
+ - terminal_reward and total_reward
19
+
20
+ Important:
21
+ - Do not invent new score dimensions outside the schema.
22
+ - Score against the hidden scenario specification, not your personal preference.
23
+ - Reward fields must be numerically coherent and self-consistent.
24
+
25
+ Respond ONLY with valid JSON matching the requested adjudicator schema.
26
+ No markdown. No explanation. No extra text.
replicalab/prompts/oracle_event_injector.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are the Event Injector for ReplicaLab.
2
+
3
+ After a negotiation round, decide whether to inject a realistic mid-episode
4
+ perturbation. Inject sparingly.
5
+
6
+ Rules:
7
+ - Never inject more than one event per episode.
8
+ - Never inject in rounds 1 or 2.
9
+ - Only inject if the negotiation is stagnating or the protocol is too comfortable.
10
+ - Events must be survivable. There must remain a path to a decent outcome.
11
+ - Use realistic events only: budget cuts, equipment failure, maintenance, backorders, scope changes, staff unavailability.
12
+ - state_changes must be a flat dictionary of dotted paths to new values.
13
+
14
+ If no event is needed, respond with:
15
+ {"inject": false}
16
+
17
+ If injecting an event, respond with:
18
+ {"inject": true, "event": <EnvironmentEvent JSON>}
19
+
20
+ No markdown. No explanation. No extra text.
replicalab/prompts/oracle_lab_manager.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are a Lab Manager at a research institution. You are practical,
2
+ detail-oriented, and protective of your lab's resources.
3
+
4
+ You have access to a constraint document that describes your lab's exact
5
+ situation: budget, equipment, reagents, staff, bookings, and safety rules.
6
+ This document is ground truth. Do not invent constraints that are not in it,
7
+ and do not ignore constraints that are.
8
+
9
+ When a Scientist proposes a protocol or asks a question:
10
+ 1. Check every element against your constraints.
11
+ 2. Report what is feasible and what is not.
12
+ 3. If something is not feasible, suggest a concrete alternative if one exists.
13
+ 4. Estimate the cost and time for what is proposed.
14
+ 5. Be collaborative but honest. Do not agree to things the lab cannot do.
15
+
16
+ Respond ONLY with valid JSON matching LabManagerResponse.
17
+ No markdown. No preamble. No extra text.
replicalab/prompts/oracle_post_mortem.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are the Post-Mortem Analyst for ReplicaLab.
2
+
3
+ At episode end, explain the outcome clearly and specifically.
4
+
5
+ Your explanation must:
6
+ - summarize the episode in 2-3 sentences
7
+ - explain rigor, feasibility, and fidelity using concrete choices from the protocol
8
+ - identify 3 to 5 impactful decisions
9
+ - list missed opportunities
10
+ - compare the final protocol to what an optimal Scientist would likely have done
11
+
12
+ Be specific and evidence-based. Refer to protocol decisions, constraints, and final scores.
13
+
14
+ Respond ONLY with valid JSON matching the PostMortem schema.
15
+ No markdown. No explanation. No extra text.
replicalab/prompts/oracle_world_architect.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are the World Architect for ReplicaLab.
2
+
3
+ You generate a complete, internally consistent scenario for one episode.
4
+ You receive a seed, difficulty level, and domain.
5
+
6
+ You must produce:
7
+ 1. A realistic research or benchmark paper/specification with a clear claim
8
+ 2. Lab or compute constraints that create real tension with the requirements
9
+ 3. A hidden minimum viable replication spec that is achievable under the constraints
10
+ 4. Valid substitutions that are scientifically or operationally defensible
11
+
12
+ Rules:
13
+ - Supported domains are math_reasoning, ml_benchmark, and finance_trading.
14
+ - The scenario must be solvable. There must always be a viable path to a reasonable outcome.
15
+ - Difficulty controls conflict density:
16
+ easy = 1-2 meaningful conflicts
17
+ medium = 3-4 meaningful conflicts
18
+ hard = 5 or more meaningful conflicts
19
+ - Budget, duration, staff skills, and substitutions must be realistic for the domain.
20
+ - Generate a short narrative_hook that helps the UI explain why this scenario is interesting.
21
+
22
+ Respond ONLY with valid JSON matching the Scenario schema.
23
+ No markdown. No explanation. No extra text.
replicalab/scenarios/__init__.py CHANGED
@@ -12,6 +12,7 @@ from .templates import (
12
  apply_difficulty,
13
  generate_scenario,
14
  load_template,
 
15
  )
16
 
17
  __all__ = [
@@ -26,4 +27,5 @@ __all__ = [
26
  "apply_difficulty",
27
  "generate_scenario",
28
  "load_template",
 
29
  ]
 
12
  apply_difficulty,
13
  generate_scenario,
14
  load_template,
15
+ oracle_scenario_to_normalized_pack,
16
  )
17
 
18
  __all__ = [
 
27
  "apply_difficulty",
28
  "generate_scenario",
29
  "load_template",
30
+ "oracle_scenario_to_normalized_pack",
31
  ]
replicalab/scenarios/templates.py CHANGED
@@ -10,6 +10,7 @@ from pydantic import BaseModel, ConfigDict
10
 
11
  from replicalab.config import MAX_BUDGET, MAX_ROUNDS
12
  from replicalab.models import LabManagerObservation, ScientistObservation
 
13
  from replicalab.scenarios.finance_trading import build_finance_trading_template
14
  from replicalab.scenarios.math_reasoning import build_math_reasoning_template
15
  from replicalab.scenarios.ml_benchmark import build_ml_benchmark_template
@@ -185,6 +186,224 @@ def generate_scenario(
185
  return _build_pack(seed=seed, template=template, draft=scaled, rng=rng)
186
 
187
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  def _build_pack(seed: int, template: TemplateName, draft: dict[str, Any], rng: Any) -> NormalizedScenarioPack:
189
  constraints = [ScenarioConstraint.model_validate(item) for item in draft["constraints"]]
190
  resources = [ScenarioResource.model_validate(item) for item in draft["resources"]]
@@ -284,6 +503,90 @@ def _build_pack(seed: int, template: TemplateName, draft: dict[str, Any], rng: A
284
  )
285
 
286
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
287
  def _split_resources(
288
  resources: list[ScenarioResource],
289
  *,
 
10
 
11
  from replicalab.config import MAX_BUDGET, MAX_ROUNDS
12
  from replicalab.models import LabManagerObservation, ScientistObservation
13
+ from replicalab.oracle_models import Scenario as OracleScenario
14
  from replicalab.scenarios.finance_trading import build_finance_trading_template
15
  from replicalab.scenarios.math_reasoning import build_math_reasoning_template
16
  from replicalab.scenarios.ml_benchmark import build_ml_benchmark_template
 
186
  return _build_pack(seed=seed, template=template, draft=scaled, rng=rng)
187
 
188
 
189
+ def oracle_scenario_to_normalized_pack(
190
+ *,
191
+ seed: int,
192
+ template: TemplateName,
193
+ oracle_scenario: OracleScenario,
194
+ max_rounds: int = MAX_ROUNDS,
195
+ ) -> NormalizedScenarioPack:
196
+ """Adapt an Oracle-generated Scenario into the canonical normalized pack."""
197
+
198
+ difficulty = oracle_scenario.difficulty.value
199
+ budget_total = oracle_scenario.lab_constraints.budget_total
200
+ budget_remaining = oracle_scenario.lab_constraints.budget_remaining
201
+ time_limit_days = oracle_scenario.lab_constraints.max_duration_days
202
+ staff_count = len(oracle_scenario.lab_constraints.staff)
203
+
204
+ constraints: list[ScenarioConstraint] = [
205
+ ScenarioConstraint(
206
+ key="budget_total",
207
+ label="Budget total",
208
+ quantity=budget_total,
209
+ unit="USD",
210
+ comparator="<=",
211
+ hard=True,
212
+ details=f"Total available budget is {budget_total:.2f} USD.",
213
+ ),
214
+ ScenarioConstraint(
215
+ key="budget_remaining",
216
+ label="Budget remaining",
217
+ quantity=budget_remaining,
218
+ unit="USD",
219
+ comparator="<=",
220
+ hard=True,
221
+ details=f"Remaining budget at episode start is {budget_remaining:.2f} USD.",
222
+ ),
223
+ ScenarioConstraint(
224
+ key="max_duration_days",
225
+ label="Maximum duration",
226
+ quantity=time_limit_days,
227
+ unit="days",
228
+ comparator="<=",
229
+ hard=True,
230
+ details=f"The plan must finish within {time_limit_days} days.",
231
+ ),
232
+ ScenarioConstraint(
233
+ key="staff_count",
234
+ label="Available staff",
235
+ quantity=staff_count,
236
+ unit="people",
237
+ comparator=">=",
238
+ hard=True,
239
+ details=f"{staff_count} staff member(s) are available for this scenario.",
240
+ ),
241
+ ]
242
+ constraints.extend(
243
+ ScenarioConstraint(
244
+ key=f"safety_rule_{index + 1}",
245
+ label=f"Safety rule {index + 1}",
246
+ comparator="=",
247
+ hard=True,
248
+ details=rule,
249
+ )
250
+ for index, rule in enumerate(oracle_scenario.lab_constraints.safety_rules)
251
+ )
252
+
253
+ resources: list[ScenarioResource] = []
254
+ for equipment in oracle_scenario.lab_constraints.equipment:
255
+ category = (
256
+ "compute"
257
+ if any(token in equipment.name.lower() for token in ("gpu", "cluster", "accelerator"))
258
+ else "tool"
259
+ )
260
+ resources.append(
261
+ ScenarioResource(
262
+ key=_slug(equipment.name),
263
+ label=equipment.name,
264
+ quantity=1,
265
+ unit="unit",
266
+ available=equipment.available and equipment.condition != "shared_booking",
267
+ category=category,
268
+ details=(
269
+ f"Condition: {equipment.condition}. "
270
+ f"Booking conflicts: {', '.join(equipment.booking_conflicts) or 'none'}."
271
+ ),
272
+ )
273
+ )
274
+
275
+ for reagent in oracle_scenario.lab_constraints.reagents:
276
+ resources.append(
277
+ ScenarioResource(
278
+ key=_slug(reagent.name),
279
+ label=reagent.name,
280
+ quantity=reagent.quantity_available,
281
+ unit=reagent.unit,
282
+ available=reagent.in_stock,
283
+ category="reference",
284
+ details=(
285
+ f"Lead time: {reagent.lead_time_days} day(s). "
286
+ f"Unit cost: {reagent.cost:.2f}."
287
+ ),
288
+ )
289
+ )
290
+
291
+ for member in oracle_scenario.lab_constraints.staff:
292
+ resources.append(
293
+ ScenarioResource(
294
+ key=_slug(member.name),
295
+ label=member.name,
296
+ quantity=len(member.available_days),
297
+ unit="days",
298
+ available=bool(member.available_days),
299
+ category="personnel",
300
+ details=f"Role: {member.role}. Skills: {', '.join(member.skills) or 'generalist'}.",
301
+ )
302
+ )
303
+
304
+ substitutions = [
305
+ AllowedSubstitution(
306
+ original=item.original,
307
+ alternative=item.substitute,
308
+ condition=item.validity,
309
+ tradeoff=item.caveats or item.validity,
310
+ )
311
+ for item in oracle_scenario.lab_constraints.valid_substitutions
312
+ ]
313
+
314
+ required_elements = (
315
+ list(oracle_scenario.minimum_viable_spec.must_keep_controls)
316
+ + list(oracle_scenario.minimum_viable_spec.critical_equipment)
317
+ + list(oracle_scenario.minimum_viable_spec.critical_reagents)
318
+ )
319
+ flexible_elements = (
320
+ list(oracle_scenario.minimum_viable_spec.acceptable_techniques)
321
+ + list(oracle_scenario.minimum_viable_spec.flexible_equipment)
322
+ + list(oracle_scenario.minimum_viable_spec.flexible_reagents)
323
+ )
324
+
325
+ hidden_reference = HiddenReferenceSpec(
326
+ summary=oracle_scenario.paper.method_summary,
327
+ required_elements=required_elements,
328
+ flexible_elements=flexible_elements,
329
+ target_metric=oracle_scenario.paper.statistical_test,
330
+ target_value=f"power>={oracle_scenario.minimum_viable_spec.power_threshold:.2f}",
331
+ )
332
+
333
+ success_criteria = [
334
+ oracle_scenario.paper.claim,
335
+ f"Preserve controls: {', '.join(oracle_scenario.paper.required_controls) or 'none listed'}",
336
+ f"Use an acceptable technique from the viable spec where possible.",
337
+ f"Stay within {budget_total:.2f} USD and {time_limit_days} days.",
338
+ ]
339
+
340
+ equipment_available = [
341
+ equipment.name
342
+ for equipment in oracle_scenario.lab_constraints.equipment
343
+ if equipment.available and equipment.condition != "shared_booking"
344
+ ]
345
+ equipment_booked = [
346
+ equipment.name
347
+ for equipment in oracle_scenario.lab_constraints.equipment
348
+ if not equipment.available or equipment.condition == "shared_booking"
349
+ ]
350
+ reagents_in_stock = [
351
+ reagent.name for reagent in oracle_scenario.lab_constraints.reagents if reagent.in_stock
352
+ ]
353
+ reagents_out_of_stock = [
354
+ reagent.name for reagent in oracle_scenario.lab_constraints.reagents if not reagent.in_stock
355
+ ]
356
+
357
+ scientist_observation = ScientistObservation(
358
+ paper_title=oracle_scenario.paper.title,
359
+ paper_hypothesis=oracle_scenario.paper.claim,
360
+ paper_method=oracle_scenario.paper.method_summary,
361
+ paper_key_finding=oracle_scenario.narrative_hook,
362
+ experiment_goal=oracle_scenario.paper.claim,
363
+ conversation_history=[],
364
+ current_protocol=None,
365
+ round_number=0,
366
+ max_rounds=max_rounds,
367
+ )
368
+
369
+ lab_manager_observation = LabManagerObservation(
370
+ budget_total=budget_total,
371
+ budget_remaining=budget_remaining,
372
+ equipment_available=equipment_available,
373
+ equipment_booked=equipment_booked,
374
+ reagents_in_stock=reagents_in_stock,
375
+ reagents_out_of_stock=reagents_out_of_stock,
376
+ staff_count=staff_count,
377
+ time_limit_days=time_limit_days,
378
+ safety_restrictions=list(oracle_scenario.lab_constraints.safety_rules),
379
+ conversation_history=[],
380
+ current_protocol=None,
381
+ round_number=0,
382
+ max_rounds=max_rounds,
383
+ )
384
+
385
+ bookings = _oracle_bookings(oracle_scenario)
386
+ windows = _oracle_windows(oracle_scenario)
387
+
388
+ return NormalizedScenarioPack(
389
+ scenario_id=f"{template}-{difficulty}-{seed}-oracle",
390
+ template=template,
391
+ domain_id=oracle_scenario.paper.domain,
392
+ difficulty=difficulty,
393
+ seed=seed,
394
+ task_summary=oracle_scenario.paper.claim,
395
+ success_criteria=success_criteria,
396
+ constraints=constraints,
397
+ resources=resources,
398
+ allowed_substitutions=substitutions,
399
+ hidden_reference_spec=hidden_reference,
400
+ scientist_observation=scientist_observation,
401
+ lab_manager_observation=lab_manager_observation,
402
+ resource_bookings=bookings,
403
+ scheduling_windows=windows,
404
+ )
405
+
406
+
407
  def _build_pack(seed: int, template: TemplateName, draft: dict[str, Any], rng: Any) -> NormalizedScenarioPack:
408
  constraints = [ScenarioConstraint.model_validate(item) for item in draft["constraints"]]
409
  resources = [ScenarioResource.model_validate(item) for item in draft["resources"]]
 
503
  )
504
 
505
 
506
+ def _slug(value: str) -> str:
507
+ return "_".join(value.lower().replace("/", " ").replace("-", " ").split())
508
+
509
+
510
+ def _day_to_offset(day: str) -> int:
511
+ mapping = {
512
+ "monday": 0,
513
+ "tuesday": 24,
514
+ "wednesday": 48,
515
+ "thursday": 72,
516
+ "friday": 96,
517
+ "saturday": 120,
518
+ "sunday": 144,
519
+ }
520
+ return mapping.get(day.strip().lower(), 0)
521
+
522
+
523
+ def _oracle_bookings(oracle_scenario: OracleScenario) -> list[ResourceBooking]:
524
+ bookings: list[ResourceBooking] = []
525
+ for equipment in oracle_scenario.lab_constraints.equipment:
526
+ if equipment.booking_conflicts:
527
+ for day in equipment.booking_conflicts:
528
+ bookings.append(
529
+ ResourceBooking(
530
+ resource_key=_slug(equipment.name),
531
+ resource_label=equipment.name,
532
+ slot_label=day,
533
+ start_offset_hours=_day_to_offset(day),
534
+ duration_hours=8.0,
535
+ status="booked" if equipment.available else "maintenance",
536
+ details=f"{equipment.name} is constrained on {day}.",
537
+ )
538
+ )
539
+ else:
540
+ bookings.append(
541
+ ResourceBooking(
542
+ resource_key=_slug(equipment.name),
543
+ resource_label=equipment.name,
544
+ slot_label="default",
545
+ start_offset_hours=0.0,
546
+ duration_hours=8.0,
547
+ status="available" if equipment.available else "maintenance",
548
+ details=f"{equipment.name} is available under normal scheduling.",
549
+ )
550
+ )
551
+ return bookings
552
+
553
+
554
+ def _oracle_windows(oracle_scenario: OracleScenario) -> list[SchedulingWindow]:
555
+ windows: list[SchedulingWindow] = [
556
+ SchedulingWindow(
557
+ key="max_duration_window",
558
+ label="Maximum project duration",
559
+ start_offset_hours=0.0,
560
+ end_offset_hours=float(oracle_scenario.lab_constraints.max_duration_days * 24),
561
+ hard=True,
562
+ details=(
563
+ f"All work must complete within "
564
+ f"{oracle_scenario.lab_constraints.max_duration_days} days."
565
+ ),
566
+ )
567
+ ]
568
+
569
+ seen_days: set[str] = set()
570
+ for member in oracle_scenario.lab_constraints.staff:
571
+ for day in member.available_days:
572
+ normalized = day.strip().lower()
573
+ if normalized in seen_days:
574
+ continue
575
+ seen_days.add(normalized)
576
+ start = float(_day_to_offset(day))
577
+ windows.append(
578
+ SchedulingWindow(
579
+ key=f"staff_{normalized}",
580
+ label=f"Staff availability {day}",
581
+ start_offset_hours=start,
582
+ end_offset_hours=start + 8.0,
583
+ hard=False,
584
+ details=f"At least one staff member is available on {day}.",
585
+ )
586
+ )
587
+ return windows
588
+
589
+
590
  def _split_resources(
591
  resources: list[ScenarioResource],
592
  *,
replicalab/scoring/explain.py CHANGED
@@ -1,6 +1,6 @@
1
- """JDG 06 Plain-English explanation builder from RewardBreakdown.
2
 
3
- Pure deterministic function reads existing breakdown fields only,
4
  introduces no new scoring logic.
5
  """
6
 
@@ -22,33 +22,31 @@ def _tier(score: float) -> str:
22
 
23
 
24
  def explain_reward(breakdown: RewardBreakdown) -> str:
25
- """Build a plain-English explanation from a RewardBreakdown.
26
-
27
- The output mirrors the three rubric components (rigor, feasibility,
28
- fidelity), any bonuses, any named penalties, and the final total.
29
- No hidden scoring logic is introduced — this is a pure formatter.
30
- """
31
  total = compute_total_reward(breakdown)
32
  lines: list[str] = []
33
 
34
- # --- rubric components ---
35
  lines.append(
36
- f"Rigor: {breakdown.rigor:.2f} ({_tier(breakdown.rigor)}) "
37
  "measures structural completeness, success-criteria coverage, "
38
  "and required-element coverage."
39
  )
40
  lines.append(
41
- f"Feasibility: {breakdown.feasibility:.2f} ({_tier(breakdown.feasibility)}) "
42
  "measures whether the protocol respects budget, equipment, reagent, "
43
  "schedule, and staffing constraints."
44
  )
45
  lines.append(
46
- f"Fidelity: {breakdown.fidelity:.2f} ({_tier(breakdown.fidelity)}) "
47
  "measures alignment with the hidden reference spec, including "
48
  "required elements, substitutions, and target metrics."
49
  )
 
 
 
 
 
50
 
51
- # --- bonuses ---
52
  if breakdown.efficiency_bonus > 0:
53
  lines.append(
54
  f"Efficiency bonus: +{breakdown.efficiency_bonus:.2f} "
@@ -59,15 +57,17 @@ def explain_reward(breakdown: RewardBreakdown) -> str:
59
  f"Communication bonus: +{breakdown.communication_bonus:.2f}."
60
  )
61
 
62
- # --- penalties ---
63
  if breakdown.penalties:
64
  for key, amount in sorted(breakdown.penalties.items()):
65
  label = key.replace("_", " ")
66
- lines.append(f"Penalty {label}: -{amount:.2f}.")
67
  else:
68
  lines.append("No penalties applied.")
69
 
70
- # --- total ---
71
- lines.append(f"Total reward: {total:.2f} (formula: 10 × rigor × feasibility × fidelity + bonuses − penalties).")
 
 
 
72
 
73
  return "\n".join(lines)
 
1
+ """JDG 06 - Plain-English explanation builder from RewardBreakdown.
2
 
3
+ Pure deterministic function - reads existing breakdown fields only,
4
  introduces no new scoring logic.
5
  """
6
 
 
22
 
23
 
24
  def explain_reward(breakdown: RewardBreakdown) -> str:
25
+ """Build a plain-English explanation from a RewardBreakdown."""
 
 
 
 
 
26
  total = compute_total_reward(breakdown)
27
  lines: list[str] = []
28
 
 
29
  lines.append(
30
+ f"Rigor: {breakdown.rigor:.2f} ({_tier(breakdown.rigor)}) - "
31
  "measures structural completeness, success-criteria coverage, "
32
  "and required-element coverage."
33
  )
34
  lines.append(
35
+ f"Feasibility: {breakdown.feasibility:.2f} ({_tier(breakdown.feasibility)}) - "
36
  "measures whether the protocol respects budget, equipment, reagent, "
37
  "schedule, and staffing constraints."
38
  )
39
  lines.append(
40
+ f"Fidelity: {breakdown.fidelity:.2f} ({_tier(breakdown.fidelity)}) - "
41
  "measures alignment with the hidden reference spec, including "
42
  "required elements, substitutions, and target metrics."
43
  )
44
+ lines.append(
45
+ f"Parsimony: {breakdown.parsimony:.2f} ({_tier(breakdown.parsimony)}) - "
46
+ "measures whether the plan stays lean instead of requesting more "
47
+ "controls, equipment, or reagents than the scenario complexity calls for."
48
+ )
49
 
 
50
  if breakdown.efficiency_bonus > 0:
51
  lines.append(
52
  f"Efficiency bonus: +{breakdown.efficiency_bonus:.2f} "
 
57
  f"Communication bonus: +{breakdown.communication_bonus:.2f}."
58
  )
59
 
 
60
  if breakdown.penalties:
61
  for key, amount in sorted(breakdown.penalties.items()):
62
  label = key.replace("_", " ")
63
+ lines.append(f"Penalty - {label}: -{amount:.2f}.")
64
  else:
65
  lines.append("No penalties applied.")
66
 
67
+ lines.append(
68
+ "Total reward: "
69
+ f"{total:.2f} "
70
+ "(formula: 10 x rigor x feasibility x fidelity x parsimony + bonuses - penalties)."
71
+ )
72
 
73
  return "\n".join(lines)
replicalab/scoring/rubric.py CHANGED
@@ -1,11 +1,14 @@
1
- """JDG 04-05 Total reward computation and reward breakdown builder.
2
 
3
- Combines rigor (JDG 01), feasibility (JDG 02), and fidelity (JDG 03)
4
- into a single scalar reward with efficiency bonus and penalties.
 
5
 
6
- Formula: total = 10 × rigor × feasibility × fidelity + bonuses − penalties
 
 
7
 
8
- Pure deterministic functions no model calls, no side effects.
9
  """
10
 
11
  from __future__ import annotations
@@ -27,12 +30,14 @@ _MAX_COMMUNICATION_BONUS = 0.0 # reserved for future use
27
 
28
 
29
  def compute_total_reward(breakdown: RewardBreakdown) -> float:
30
- """Compute the scalar reward from a RewardBreakdown.
31
-
32
- Formula: 10 × rigor × feasibility × fidelity + efficiency_bonus
33
- + communication_bonus − sum(penalties)
34
- """
35
- base = _REWARD_SCALE * breakdown.rigor * breakdown.feasibility * breakdown.fidelity
 
 
36
  bonus = breakdown.efficiency_bonus + breakdown.communication_bonus
37
  penalty = sum(breakdown.penalties.values())
38
  return max(0.0, round(base + bonus - penalty, 6))
@@ -47,32 +52,14 @@ def build_reward_breakdown(
47
  check: FeasibilityCheckResult | None = None,
48
  penalties: dict[str, float] | None = None,
49
  ) -> RewardBreakdown:
50
- """Build a full RewardBreakdown from the three sub-scores plus bonuses.
51
-
52
- Parameters
53
- ----------
54
- protocol : Protocol
55
- The final agreed protocol.
56
- scenario : NormalizedScenarioPack
57
- The scenario pack for this episode.
58
- rounds_used : int
59
- How many rounds were consumed.
60
- max_rounds : int
61
- The episode's round cap.
62
- check : FeasibilityCheckResult, optional
63
- Pre-computed feasibility check to avoid redundant work.
64
- penalties : dict[str, float], optional
65
- Named penalty keys for bounded-tool diagnostics, unsupported
66
- evidence claims, or other deterministic deductions. Use named
67
- keys (e.g. ``"invalid_tool_use"``, ``"unsupported_claim"``)
68
- instead of adding new fields to RewardBreakdown.
69
- """
70
  if check is None:
71
  check = check_feasibility(protocol, scenario)
72
 
73
  rigor = score_rigor(protocol, scenario)
74
  feasibility = score_feasibility(protocol, scenario, check=check)
75
  fidelity = score_fidelity(protocol, scenario)
 
76
 
77
  efficiency_bonus = _efficiency_bonus(rounds_used, max_rounds)
78
  merged_penalties = dict(penalties) if penalties else {}
@@ -81,6 +68,7 @@ def build_reward_breakdown(
81
  rigor=rigor,
82
  feasibility=feasibility,
83
  fidelity=fidelity,
 
84
  efficiency_bonus=efficiency_bonus,
85
  communication_bonus=0.0,
86
  penalties=merged_penalties,
@@ -88,12 +76,33 @@ def build_reward_breakdown(
88
 
89
 
90
  def _efficiency_bonus(rounds_used: int, max_rounds: int) -> float:
91
- """Reward finishing in fewer rounds.
92
-
93
- If the scientist reaches agreement in round 1 of 6, that's the maximum
94
- bonus. If they use all rounds, the bonus is 0.
95
- """
96
  if max_rounds <= 1 or rounds_used <= 0:
97
  return 0.0
98
  saved = max(0, max_rounds - rounds_used)
99
  return round(_MAX_EFFICIENCY_BONUS * saved / (max_rounds - 1), 6)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """JDG 04-05 - Total reward computation and reward breakdown builder.
2
 
3
+ Combines rigor (JDG 01), feasibility (JDG 02), fidelity (JDG 03), and a
4
+ lightweight parsimony term into a single scalar reward with bonuses and
5
+ named penalties.
6
 
7
+ Formula:
8
+ total = 10 * rigor * feasibility * fidelity * parsimony
9
+ + bonuses - penalties
10
 
11
+ Pure deterministic functions - no model calls, no side effects.
12
  """
13
 
14
  from __future__ import annotations
 
30
 
31
 
32
  def compute_total_reward(breakdown: RewardBreakdown) -> float:
33
+ """Compute the scalar reward from a RewardBreakdown."""
34
+ base = (
35
+ _REWARD_SCALE
36
+ * breakdown.rigor
37
+ * breakdown.feasibility
38
+ * breakdown.fidelity
39
+ * breakdown.parsimony
40
+ )
41
  bonus = breakdown.efficiency_bonus + breakdown.communication_bonus
42
  penalty = sum(breakdown.penalties.values())
43
  return max(0.0, round(base + bonus - penalty, 6))
 
52
  check: FeasibilityCheckResult | None = None,
53
  penalties: dict[str, float] | None = None,
54
  ) -> RewardBreakdown:
55
+ """Build a full RewardBreakdown from the sub-scores plus bonuses."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  if check is None:
57
  check = check_feasibility(protocol, scenario)
58
 
59
  rigor = score_rigor(protocol, scenario)
60
  feasibility = score_feasibility(protocol, scenario, check=check)
61
  fidelity = score_fidelity(protocol, scenario)
62
+ parsimony = _score_parsimony(protocol, scenario)
63
 
64
  efficiency_bonus = _efficiency_bonus(rounds_used, max_rounds)
65
  merged_penalties = dict(penalties) if penalties else {}
 
68
  rigor=rigor,
69
  feasibility=feasibility,
70
  fidelity=fidelity,
71
+ parsimony=parsimony,
72
  efficiency_bonus=efficiency_bonus,
73
  communication_bonus=0.0,
74
  penalties=merged_penalties,
 
76
 
77
 
78
  def _efficiency_bonus(rounds_used: int, max_rounds: int) -> float:
79
+ """Reward finishing in fewer rounds."""
 
 
 
 
80
  if max_rounds <= 1 or rounds_used <= 0:
81
  return 0.0
82
  saved = max(0, max_rounds - rounds_used)
83
  return round(_MAX_EFFICIENCY_BONUS * saved / (max_rounds - 1), 6)
84
+
85
+
86
+ def _score_parsimony(
87
+ protocol: Protocol,
88
+ scenario: NormalizedScenarioPack,
89
+ ) -> float:
90
+ """Score how lean the protocol is relative to scenario complexity.
91
+
92
+ The current scenario schema does not expose explicit "necessary resource"
93
+ labels, so we infer complexity from the hidden required-element count and
94
+ penalize plans that request far more unique controls/resources than that
95
+ complexity suggests.
96
+ """
97
+ required_element_count = len(scenario.hidden_reference_spec.required_elements)
98
+ complexity_budget = max(2, required_element_count + 2)
99
+ requested_count = (
100
+ len(set(protocol.controls))
101
+ + len(set(protocol.required_equipment))
102
+ + len(set(protocol.required_reagents))
103
+ )
104
+ if requested_count <= 0:
105
+ return 1.0
106
+
107
+ ratio = complexity_budget / max(complexity_budget, requested_count)
108
+ return round(max(0.25, min(1.0, ratio)), 6)
replicalab/training/rollout.py CHANGED
@@ -164,9 +164,9 @@ class RolloutWorker:
164
  )
165
  record.steps.append(step)
166
  record.tool_traces.extend(tool_traces)
 
167
 
168
  if result.done:
169
- record.total_reward = result.reward
170
  record.reward_breakdown = result.info.reward_breakdown
171
  record.judge_notes = result.info.judge_notes
172
  record.verdict = result.info.verdict
 
164
  )
165
  record.steps.append(step)
166
  record.tool_traces.extend(tool_traces)
167
+ record.total_reward = round(record.total_reward + result.reward, 6)
168
 
169
  if result.done:
 
170
  record.reward_breakdown = result.info.reward_breakdown
171
  record.judge_notes = result.info.judge_notes
172
  record.verdict = result.info.verdict
server/app.py CHANGED
@@ -111,7 +111,7 @@ def _build_episode_log(
111
  final_state=state,
112
  transcript=list(state.conversation_history),
113
  reward_breakdown=info.reward_breakdown,
114
- total_reward=result.reward,
115
  rounds_used=state.round_number,
116
  agreement_reached=info.agreement_reached,
117
  judge_notes=info.judge_notes or "",
 
111
  final_state=state,
112
  transcript=list(state.conversation_history),
113
  reward_breakdown=info.reward_breakdown,
114
+ total_reward=state.reward,
115
  rounds_used=state.round_number,
116
  agreement_reached=info.agreement_reached,
117
  judge_notes=info.judge_notes or "",
tests/test_cache.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+
5
+ from replicalab.cache import CachedOracle, ScenarioCache
6
+ from replicalab.oracle_models import Scenario
7
+
8
+
9
+ def _scenario_payload() -> dict:
10
+ return {
11
+ "paper": {
12
+ "title": "Cached benchmark",
13
+ "domain": "ml_benchmark",
14
+ "claim": "A small run remains useful under a tighter budget.",
15
+ "method_summary": "Train a compact model and verify against a held-out split.",
16
+ "original_sample_size": 1000,
17
+ "original_duration_days": 2,
18
+ "original_technique": "compact_model",
19
+ "required_controls": ["baseline"],
20
+ "required_equipment": ["GPU cluster"],
21
+ "required_reagents": ["dataset snapshot"],
22
+ "statistical_test": "accuracy_gap",
23
+ },
24
+ "lab_constraints": {
25
+ "budget_total": 1200.0,
26
+ "budget_remaining": 1200.0,
27
+ "equipment": [
28
+ {
29
+ "name": "GPU cluster",
30
+ "available": True,
31
+ "condition": "operational",
32
+ "booking_conflicts": [],
33
+ "cost_per_use": 100.0,
34
+ }
35
+ ],
36
+ "reagents": [
37
+ {
38
+ "name": "dataset snapshot",
39
+ "in_stock": True,
40
+ "quantity_available": 1.0,
41
+ "unit": "copy",
42
+ "lead_time_days": 0,
43
+ "cost": 0.0,
44
+ }
45
+ ],
46
+ "staff": [],
47
+ "max_duration_days": 3,
48
+ "safety_rules": ["No external internet."],
49
+ "valid_substitutions": [],
50
+ },
51
+ "minimum_viable_spec": {
52
+ "min_sample_size": 800,
53
+ "must_keep_controls": ["baseline"],
54
+ "acceptable_techniques": ["compact_model"],
55
+ "min_duration_days": 1,
56
+ "critical_equipment": ["GPU cluster"],
57
+ "flexible_equipment": [],
58
+ "critical_reagents": ["dataset snapshot"],
59
+ "flexible_reagents": [],
60
+ "power_threshold": 0.75,
61
+ },
62
+ "difficulty": "easy",
63
+ "narrative_hook": "The benchmark owners tightened the reporting budget.",
64
+ }
65
+
66
+
67
+ def test_scenario_cache_round_trips(tmp_path) -> None:
68
+ cache = ScenarioCache(tmp_path)
69
+ scenario = Scenario.model_validate(_scenario_payload())
70
+
71
+ path = cache.put(13, "easy", "ml_benchmark", scenario)
72
+ restored = cache.get(13, "easy", "ml_benchmark")
73
+
74
+ assert path.exists()
75
+ assert restored is not None
76
+ assert restored.model_dump(mode="json") == scenario.model_dump(mode="json")
77
+
78
+
79
+ def test_cached_oracle_uses_cache_after_first_generation(tmp_path) -> None:
80
+ calls = {"count": 0}
81
+
82
+ def fake_client(system: str, user: str, model: str) -> str:
83
+ calls["count"] += 1
84
+ return json.dumps(_scenario_payload())
85
+
86
+ oracle = CachedOracle(fake_client, cache=ScenarioCache(tmp_path))
87
+
88
+ first = oracle.generate_scenario(9, "easy", "ml_benchmark")
89
+ second = oracle.generate_scenario(9, "easy", "ml_benchmark")
90
+
91
+ assert first.model_dump(mode="json") == second.model_dump(mode="json")
92
+ assert calls["count"] == 1
tests/test_env.py CHANGED
@@ -342,7 +342,9 @@ class TestStep:
342
 
343
  assert env.state().round_number == 1
344
  assert result.done is False
345
- assert result.reward == 0.0
 
 
346
 
347
  def test_step_returns_observations(self) -> None:
348
  env = ReplicaLabEnv()
@@ -416,7 +418,9 @@ class TestStep:
416
 
417
  assert result.done is True
418
  assert result.info.agreement_reached is False
419
- assert result.reward == 0.0
 
 
420
 
421
  def test_step_info_has_round_and_episode_id(self) -> None:
422
  env = ReplicaLabEnv()
@@ -655,7 +659,8 @@ class TestEnvReward:
655
  assert result.done
656
  assert result.info.verdict == "timeout"
657
  assert result.info.reward_breakdown is not None
658
- assert result.reward == 0.0
 
659
 
660
  def test_episode_state_stores_final_scores(self) -> None:
661
  env = ReplicaLabEnv()
 
342
 
343
  assert env.state().round_number == 1
344
  assert result.done is False
345
+ assert result.reward > 0.0
346
+ assert result.info.step_reward_components["protocol_delta_bonus"] > 0.0
347
+ assert result.info.cumulative_reward == result.reward
348
 
349
  def test_step_returns_observations(self) -> None:
350
  env = ReplicaLabEnv()
 
418
 
419
  assert result.done is True
420
  assert result.info.agreement_reached is False
421
+ assert result.reward < 0.0
422
+ assert result.info.reward_breakdown is not None
423
+ assert result.info.reward_breakdown.penalties["timeout"] > 0.0
424
 
425
  def test_step_info_has_round_and_episode_id(self) -> None:
426
  env = ReplicaLabEnv()
 
659
  assert result.done
660
  assert result.info.verdict == "timeout"
661
  assert result.info.reward_breakdown is not None
662
+ assert result.reward < 0.0
663
+ assert result.info.reward_breakdown.penalties["timeout"] > 0.0
664
 
665
  def test_episode_state_stores_final_scores(self) -> None:
666
  env = ReplicaLabEnv()
tests/test_oracle.py ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+
5
+ from replicalab.agents.lab_manager_agent import LabManagerAgent
6
+ from replicalab.env import ReplicaLabEnv
7
+ from replicalab.models import ScientistAction
8
+ from replicalab.oracle import Oracle
9
+ from replicalab.oracle_models import (
10
+ AdjudicatorRoundScore,
11
+ EnvironmentEvent,
12
+ OracleLabManagerObservation,
13
+ PostMortem,
14
+ Scenario,
15
+ )
16
+
17
+
18
+ def _scenario_payload() -> dict:
19
+ return {
20
+ "paper": {
21
+ "title": "Reproducing a Small Vision Benchmark",
22
+ "domain": "ml_benchmark",
23
+ "claim": "A compact model can recover >90% of reference accuracy under budget.",
24
+ "method_summary": "Train a compact CNN with fixed augmentations and evaluate on a held-out split.",
25
+ "original_sample_size": 1200,
26
+ "original_duration_days": 3,
27
+ "original_technique": "compact_cnn",
28
+ "required_controls": ["seed_control", "baseline_model"],
29
+ "required_equipment": ["GPU cluster", "validation server"],
30
+ "required_reagents": ["dataset snapshot"],
31
+ "statistical_test": "accuracy_gap",
32
+ },
33
+ "lab_constraints": {
34
+ "budget_total": 2400.0,
35
+ "budget_remaining": 2400.0,
36
+ "equipment": [
37
+ {
38
+ "name": "GPU cluster",
39
+ "available": True,
40
+ "condition": "shared_booking",
41
+ "booking_conflicts": ["Monday"],
42
+ "cost_per_use": 250.0,
43
+ },
44
+ {
45
+ "name": "Validation server",
46
+ "available": True,
47
+ "condition": "operational",
48
+ "booking_conflicts": [],
49
+ "cost_per_use": 20.0,
50
+ },
51
+ ],
52
+ "reagents": [
53
+ {
54
+ "name": "dataset snapshot",
55
+ "in_stock": True,
56
+ "quantity_available": 1.0,
57
+ "unit": "copy",
58
+ "lead_time_days": 0,
59
+ "cost": 0.0,
60
+ }
61
+ ],
62
+ "staff": [
63
+ {
64
+ "name": "Alex",
65
+ "role": "engineer",
66
+ "available_days": ["Monday", "Tuesday"],
67
+ "skills": ["training", "evaluation"],
68
+ }
69
+ ],
70
+ "max_duration_days": 5,
71
+ "safety_rules": ["No external internet during training."],
72
+ "valid_substitutions": [
73
+ {
74
+ "original": "GPU cluster",
75
+ "substitute": "single high-memory GPU",
76
+ "validity": "acceptable_with_caveats",
77
+ "caveats": "Lower throughput is acceptable if evaluation fidelity is preserved.",
78
+ }
79
+ ],
80
+ },
81
+ "minimum_viable_spec": {
82
+ "min_sample_size": 800,
83
+ "must_keep_controls": ["seed_control", "baseline_model"],
84
+ "acceptable_techniques": ["compact_cnn", "distilled_cnn"],
85
+ "min_duration_days": 2,
86
+ "critical_equipment": ["Validation server"],
87
+ "flexible_equipment": ["GPU cluster"],
88
+ "critical_reagents": ["dataset snapshot"],
89
+ "flexible_reagents": [],
90
+ "power_threshold": 0.8,
91
+ },
92
+ "difficulty": "medium",
93
+ "narrative_hook": "The compute team just reduced your preferred GPU window.",
94
+ }
95
+
96
+
97
+ def _round_score_payload() -> dict:
98
+ return {
99
+ "rigor_flags": ["kept baseline_model"],
100
+ "feasibility_flags": ["GPU window narrowed"],
101
+ "info_gain": 0.6,
102
+ "protocol_delta": 0.4,
103
+ "momentum": 0.7,
104
+ "contradiction_detected": False,
105
+ "stalling_detected": False,
106
+ "step_reward": 0.55,
107
+ "notes": "Scientist asked a useful scheduling question and preserved controls.",
108
+ }
109
+
110
+
111
+ def _post_mortem_payload() -> dict:
112
+ return {
113
+ "overall_summary": "The Scientist converged on a feasible compact CNN plan.",
114
+ "rigor_explanation": "Controls and the validation server were preserved.",
115
+ "feasibility_explanation": "The final plan fit the available compute and duration window.",
116
+ "fidelity_explanation": "The protocol stayed close to the benchmark setup.",
117
+ "key_decisions": ["Kept seed control", "Accepted lower-throughput compute"],
118
+ "missed_opportunities": ["Could have asked about booking conflicts earlier"],
119
+ "comparison_note": "An optimal Scientist would have requested the alternate GPU window one round sooner.",
120
+ }
121
+
122
+
123
+ class _FakeMessagesAPI:
124
+ def __init__(self, payloads: list[dict]) -> None:
125
+ self._payloads = payloads
126
+ self.calls = 0
127
+
128
+ def create(self, **_: object):
129
+ payload = self._payloads[self.calls]
130
+ self.calls += 1
131
+
132
+ class _Chunk:
133
+ def __init__(self, text: str) -> None:
134
+ self.text = text
135
+
136
+ class _Response:
137
+ def __init__(self, text: str) -> None:
138
+ self.content = [_Chunk(text)]
139
+
140
+ return _Response(json.dumps(payload))
141
+
142
+
143
+ class _FakeClient:
144
+ def __init__(self, payloads: list[dict]) -> None:
145
+ self.messages = _FakeMessagesAPI(payloads)
146
+
147
+
148
+ def test_oracle_generate_scenario_parses_json() -> None:
149
+ oracle = Oracle(_FakeClient([_scenario_payload()]))
150
+
151
+ scenario = oracle.generate_scenario(seed=7, difficulty="medium", domain="ml_benchmark")
152
+
153
+ assert isinstance(scenario, Scenario)
154
+ assert scenario.paper.domain == "ml_benchmark"
155
+ assert scenario.lab_constraints.equipment[0].name == "GPU cluster"
156
+
157
+
158
+ def test_oracle_score_round_parses_structured_payload() -> None:
159
+ oracle = Oracle(_FakeClient([_round_score_payload()]))
160
+ scenario = Scenario.model_validate(_scenario_payload())
161
+ action = ScientistAction(
162
+ action_type="request_info",
163
+ sample_size=0,
164
+ controls=[],
165
+ technique="",
166
+ duration_days=0,
167
+ required_equipment=[],
168
+ required_reagents=[],
169
+ questions=["When is the GPU cluster available?"],
170
+ rationale="",
171
+ )
172
+ lab_manager = LabManagerAgent(_FakeClient([{
173
+ "response_type": "feasibility_report",
174
+ "feasible": False,
175
+ "issues": ["GPU cluster is shared-booked on Monday"],
176
+ "suggestions": ["Use the single high-memory GPU instead"],
177
+ "cost_estimate": 250.0,
178
+ "time_estimate_days": 3,
179
+ "message": "The GPU cluster is shared-booked Monday; the single high-memory GPU is acceptable with caveats.",
180
+ }]))
181
+ response = lab_manager.respond(
182
+ OracleLabManagerObservation(
183
+ lab_constraints=scenario.lab_constraints,
184
+ current_protocol=None,
185
+ scientist_action=action,
186
+ round_number=1,
187
+ )
188
+ )
189
+
190
+ score = oracle.score_round(
191
+ scenario=scenario,
192
+ round_number=1,
193
+ scientist_action=action,
194
+ lab_manager_response=response,
195
+ conversation_history=[],
196
+ current_protocol=None,
197
+ previous_scores=[],
198
+ )
199
+
200
+ assert isinstance(score, AdjudicatorRoundScore)
201
+ assert score.step_reward == 0.55
202
+
203
+
204
+ def test_oracle_maybe_inject_event_returns_optional_event() -> None:
205
+ oracle = Oracle(_FakeClient([{"inject": True, "event": {
206
+ "event_type": "budget_cut",
207
+ "description": "Finance reduced the remaining budget.",
208
+ "state_changes": {"lab_constraints.budget_remaining": 1800.0},
209
+ "severity": "moderate",
210
+ }}]))
211
+
212
+ event = oracle.maybe_inject_event(
213
+ scenario=Scenario.model_validate(_scenario_payload()),
214
+ round_number=3,
215
+ current_protocol=None,
216
+ conversation_history=[],
217
+ inject_enabled=True,
218
+ )
219
+
220
+ assert isinstance(event, EnvironmentEvent)
221
+ assert event.event_type == "budget_cut"
222
+
223
+
224
+ def test_oracle_generate_post_mortem_parses_json() -> None:
225
+ oracle = Oracle(_FakeClient([_post_mortem_payload()]))
226
+ from replicalab.oracle_models import AdjudicatorTerminalScore
227
+
228
+ post_mortem = oracle.generate_post_mortem(
229
+ scenario=Scenario.model_validate(_scenario_payload()),
230
+ final_protocol={"technique": "compact_cnn"},
231
+ conversation_history=[],
232
+ terminal_score=AdjudicatorTerminalScore(
233
+ rigor=0.9,
234
+ feasibility=0.8,
235
+ fidelity=0.85,
236
+ parsimony=0.9,
237
+ robustness=0.8,
238
+ power_preservation=0.8,
239
+ efficiency_bonus=0.2,
240
+ communication_bonus=0.1,
241
+ penalties={},
242
+ terminal_reward=5.0,
243
+ total_reward=5.6,
244
+ ),
245
+ )
246
+
247
+ assert isinstance(post_mortem, PostMortem)
248
+ assert "feasible compact CNN plan" in post_mortem.overall_summary
249
+
250
+
251
+ def test_env_can_reset_from_oracle_scenario_without_changing_outer_contract() -> None:
252
+ class _FakeOracle:
253
+ def __init__(self) -> None:
254
+ self.scenario = Scenario.model_validate(_scenario_payload())
255
+
256
+ def generate_scenario(self, seed: int, difficulty: str, domain: str) -> Scenario:
257
+ assert seed == 11
258
+ assert difficulty == "medium"
259
+ assert domain == "ml_benchmark"
260
+ return self.scenario
261
+
262
+ def score_round(self, **_: object):
263
+ return AdjudicatorRoundScore.model_validate(_round_score_payload())
264
+
265
+ def maybe_inject_event(self, **_: object):
266
+ return None
267
+
268
+ def generate_post_mortem(self, **_: object):
269
+ return PostMortem.model_validate(_post_mortem_payload())
270
+
271
+ env = ReplicaLabEnv(
272
+ oracle=_FakeOracle(),
273
+ enable_oracle_post_mortem=True,
274
+ )
275
+ observation = env.reset(seed=11, scenario="ml_benchmark", difficulty="medium")
276
+
277
+ assert observation.scientist is not None
278
+ assert observation.scientist.paper_title == "Reproducing a Small Vision Benchmark"
279
+ assert observation.lab_manager is not None
280
+ assert "Validation server" in observation.lab_manager.equipment_available
281
+
tests/test_prompts.py CHANGED
@@ -3,6 +3,7 @@
3
  from __future__ import annotations
4
 
5
  from replicalab.prompts import (
 
6
  load_prompt_template,
7
  render_judge_prompt,
8
  render_lab_manager_prompt,
@@ -22,6 +23,18 @@ def test_load_prompt_template_reads_all_role_files() -> None:
22
  assert "ReplicaLab" in template
23
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  def test_render_scientist_prompt_injects_task_and_bounded_tools() -> None:
26
  prompt = render_scientist_prompt(_scenario("ml_benchmark"))
27
 
 
3
  from __future__ import annotations
4
 
5
  from replicalab.prompts import (
6
+ load_prompt_asset,
7
  load_prompt_template,
8
  render_judge_prompt,
9
  render_lab_manager_prompt,
 
23
  assert "ReplicaLab" in template
24
 
25
 
26
+ def test_load_oracle_prompt_assets_reads_all_oracle_files() -> None:
27
+ for name in (
28
+ "oracle_world_architect",
29
+ "oracle_adjudicator",
30
+ "oracle_event_injector",
31
+ "oracle_post_mortem",
32
+ "oracle_lab_manager",
33
+ ):
34
+ template = load_prompt_asset(name)
35
+ assert len(template) > 100
36
+
37
+
38
  def test_render_scientist_prompt_injects_task_and_bounded_tools() -> None:
39
  prompt = render_scientist_prompt(_scenario("ml_benchmark"))
40
 
tests/test_scenarios.py CHANGED
@@ -7,7 +7,9 @@ from replicalab.scenarios import (
7
  NormalizedScenarioPack,
8
  available_scenario_families,
9
  generate_scenario,
 
10
  )
 
11
 
12
 
13
  def test_generate_scenario_is_deterministic_for_same_seed() -> None:
@@ -140,3 +142,90 @@ def test_all_domains_produce_bookings_and_windows() -> None:
140
  pack = generate_scenario(seed=42, template=template, difficulty="medium")
141
  assert len(pack.resource_bookings) > 0, f"{template} has no bookings"
142
  assert len(pack.scheduling_windows) > 0, f"{template} has no windows"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  NormalizedScenarioPack,
8
  available_scenario_families,
9
  generate_scenario,
10
+ oracle_scenario_to_normalized_pack,
11
  )
12
+ from replicalab.oracle_models import Scenario as OracleScenario
13
 
14
 
15
  def test_generate_scenario_is_deterministic_for_same_seed() -> None:
 
142
  pack = generate_scenario(seed=42, template=template, difficulty="medium")
143
  assert len(pack.resource_bookings) > 0, f"{template} has no bookings"
144
  assert len(pack.scheduling_windows) > 0, f"{template} has no windows"
145
+
146
+
147
+ def test_oracle_scenario_adapter_preserves_domain_and_constraints() -> None:
148
+ oracle_scenario = OracleScenario.model_validate(
149
+ {
150
+ "paper": {
151
+ "title": "Adapting a benchmark under constraint",
152
+ "domain": "ml_benchmark",
153
+ "claim": "A small model remains competitive after budget cuts.",
154
+ "method_summary": "Train a compact benchmark baseline with fixed controls.",
155
+ "original_sample_size": 1200,
156
+ "original_duration_days": 3,
157
+ "original_technique": "compact_cnn",
158
+ "required_controls": ["baseline", "seed_control"],
159
+ "required_equipment": ["GPU cluster"],
160
+ "required_reagents": ["dataset snapshot"],
161
+ "statistical_test": "accuracy_gap",
162
+ },
163
+ "lab_constraints": {
164
+ "budget_total": 1800.0,
165
+ "budget_remaining": 1500.0,
166
+ "equipment": [
167
+ {
168
+ "name": "GPU cluster",
169
+ "available": True,
170
+ "condition": "shared_booking",
171
+ "booking_conflicts": ["Monday"],
172
+ "cost_per_use": 200.0,
173
+ }
174
+ ],
175
+ "reagents": [
176
+ {
177
+ "name": "dataset snapshot",
178
+ "in_stock": True,
179
+ "quantity_available": 1.0,
180
+ "unit": "copy",
181
+ "lead_time_days": 0,
182
+ "cost": 0.0,
183
+ }
184
+ ],
185
+ "staff": [
186
+ {
187
+ "name": "Alex",
188
+ "role": "engineer",
189
+ "available_days": ["Monday", "Tuesday"],
190
+ "skills": ["training", "evaluation"],
191
+ }
192
+ ],
193
+ "max_duration_days": 5,
194
+ "safety_rules": ["No external internet during training."],
195
+ "valid_substitutions": [
196
+ {
197
+ "original": "GPU cluster",
198
+ "substitute": "single high-memory GPU",
199
+ "validity": "acceptable_with_caveats",
200
+ "caveats": "Longer runtime is acceptable if evaluation fidelity is preserved.",
201
+ }
202
+ ],
203
+ },
204
+ "minimum_viable_spec": {
205
+ "min_sample_size": 800,
206
+ "must_keep_controls": ["baseline", "seed_control"],
207
+ "acceptable_techniques": ["compact_cnn"],
208
+ "min_duration_days": 2,
209
+ "critical_equipment": ["GPU cluster"],
210
+ "flexible_equipment": [],
211
+ "critical_reagents": ["dataset snapshot"],
212
+ "flexible_reagents": [],
213
+ "power_threshold": 0.8,
214
+ },
215
+ "difficulty": "medium",
216
+ "narrative_hook": "The preferred GPU window has been partially reallocated.",
217
+ }
218
+ )
219
+
220
+ pack = oracle_scenario_to_normalized_pack(
221
+ seed=7,
222
+ template="ml_benchmark",
223
+ oracle_scenario=oracle_scenario,
224
+ )
225
+
226
+ assert pack.domain_id == "ml_benchmark"
227
+ assert pack.scientist_observation.paper_title == "Adapting a benchmark under constraint"
228
+ assert pack.lab_manager_observation.budget_total == 1800.0
229
+ assert "GPU cluster" in pack.lab_manager_observation.equipment_booked
230
+ assert pack.hidden_reference_spec.required_elements
231
+ assert pack.resource_bookings
tests/test_server.py CHANGED
@@ -509,7 +509,8 @@ class TestWebSocket:
509
 
510
  assert resp["type"] == "step_ok"
511
  assert resp["done"] is False
512
- assert resp["reward"] == 0.0
 
513
  assert resp["observation"] is not None
514
 
515
  def test_ws_full_episode_real_reward(self, client: TestClient) -> None:
@@ -627,8 +628,9 @@ class TestWebSocket:
627
 
628
  assert resp["done"] is True
629
  assert resp["info"]["verdict"] == "timeout"
630
- assert resp["reward"] == 0.0
631
  assert resp["info"]["reward_breakdown"] is not None
 
632
 
633
  def test_ws_terminal_episode_persists_real_replay_log(
634
  self, client: TestClient
 
509
 
510
  assert resp["type"] == "step_ok"
511
  assert resp["done"] is False
512
+ assert resp["reward"] > 0.0
513
+ assert resp["info"]["step_reward_components"]["protocol_delta_bonus"] > 0.0
514
  assert resp["observation"] is not None
515
 
516
  def test_ws_full_episode_real_reward(self, client: TestClient) -> None:
 
628
 
629
  assert resp["done"] is True
630
  assert resp["info"]["verdict"] == "timeout"
631
+ assert resp["reward"] < 0.0
632
  assert resp["info"]["reward_breakdown"] is not None
633
+ assert resp["info"]["reward_breakdown"]["penalties"]["timeout"] > 0.0
634
 
635
  def test_ws_terminal_episode_persists_real_replay_log(
636
  self, client: TestClient