Spaces:
Running
Running
Add hybrid Oracle layer and update architecture docs
Browse files- README.md +114 -74
- ReplicaLab_Architecture_v2.svg +3 -0
- ReplicaLab_Architecture_v2_polished.svg +3 -0
- docs/changes.md +2 -0
- docs/map/scoring.md +30 -0
- docs/map/tests.md +12 -5
- replicalab/__init__.py +3 -1
- replicalab/agents/__init__.py +2 -0
- replicalab/agents/judge_policy.py +8 -0
- replicalab/agents/lab_manager_agent.py +47 -0
- replicalab/cache.py +66 -0
- replicalab/config.py +26 -0
- replicalab/models.py +3 -0
- replicalab/oracle.py +263 -0
- replicalab/oracle_models.py +221 -0
- replicalab/prompts/__init__.py +10 -3
- replicalab/prompts/oracle_adjudicator.txt +26 -0
- replicalab/prompts/oracle_event_injector.txt +20 -0
- replicalab/prompts/oracle_lab_manager.txt +17 -0
- replicalab/prompts/oracle_post_mortem.txt +15 -0
- replicalab/prompts/oracle_world_architect.txt +23 -0
- replicalab/scenarios/__init__.py +2 -0
- replicalab/scenarios/templates.py +303 -0
- replicalab/scoring/explain.py +17 -17
- replicalab/scoring/rubric.py +45 -36
- replicalab/training/rollout.py +1 -1
- server/app.py +1 -1
- tests/test_cache.py +92 -0
- tests/test_env.py +8 -3
- tests/test_oracle.py +281 -0
- tests/test_prompts.py +13 -0
- tests/test_scenarios.py +89 -0
- tests/test_server.py +4 -2
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
title: ReplicaLab
|
| 3 |
-
emoji: 🧪
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
|
@@ -14,37 +14,53 @@ pinned: false
|
|
| 14 |
|
| 15 |
> *How do we adapt a plan without breaking the objective?*
|
| 16 |
|
| 17 |
-
ReplicaLab trains
|
| 18 |
|
| 19 |
## Current Build Status
|
| 20 |
|
| 21 |
-
- The repository is
|
| 22 |
-
- The Python package foundation is verified through editable install plus full test
|
| 23 |
-
- Shared contracts
|
| 24 |
-
- `server/app.py`
|
| 25 |
- `openenv.yaml` exists and passes local OpenEnv validation.
|
| 26 |
- Local Docker validation has been completed for the server image on port `7860`.
|
| 27 |
-
- Hugging Face Spaces deployment is live at `https://ayushozha-replicalab.hf.space`
|
| 28 |
-
- The frozen outer contract remains stable while the internal scenario engine
|
| 29 |
-
- The
|
|
|
|
| 30 |
|
| 31 |
## Team Ownership
|
| 32 |
|
| 33 |
| Owner | Current focus |
|
| 34 |
|------|----------------|
|
| 35 |
-
| Kian (Person A) | Shared schemas, validation,
|
| 36 |
-
| Person B (Ayush) |
|
| 37 |
-
| Max (Person C) |
|
| 38 |
-
| Kush (Person D) |
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
## Architecture
|
| 43 |
|
| 44 |
<p align="center">
|
| 45 |
-
<img src="./
|
| 46 |
</p>
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
---
|
| 49 |
|
| 50 |
## How It Works
|
|
@@ -55,26 +71,31 @@ Each episode simulates a negotiation between two agents inside a constrained tec
|
|
| 55 |
|------|------|----------------|
|
| 56 |
| **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
|
| 57 |
| **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
|
| 58 |
-
| **Judge** | Deterministic rubric engine | Scores the final plan on
|
|
|
|
| 59 |
|
| 60 |
### Episode Lifecycle
|
| 61 |
|
| 62 |
-
1. **Reset**
|
| 63 |
-
2. **Scientist observes**
|
| 64 |
-
3. **Lab Manager observes**
|
| 65 |
-
4. **Negotiation**
|
| 66 |
-
5. **Agreement or timeout**
|
| 67 |
-
6. **Reward**
|
|
|
|
| 68 |
|
| 69 |
### Reward Formula
|
| 70 |
|
| 71 |
-
```
|
| 72 |
-
total_reward = 10 * rigor * feasibility * fidelity
|
|
|
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
-
The
|
| 76 |
|
| 77 |
-
### Internal
|
| 78 |
|
| 79 |
The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
|
| 80 |
|
|
@@ -82,27 +103,24 @@ The outer action and observation models stay stable. Domain-specific content is
|
|
| 82 |
|
| 83 |
## Getting Started
|
| 84 |
|
| 85 |
-
This section mixes verified foundation commands with planned end-to-end commands.
|
| 86 |
|
| 87 |
### Prerequisites
|
| 88 |
|
| 89 |
- Python 3.10+
|
| 90 |
-
- Node.js 18+
|
| 91 |
-
- Docker
|
| 92 |
-
- A Google Colab
|
| 93 |
|
| 94 |
### Installation
|
| 95 |
|
| 96 |
```bash
|
| 97 |
-
# Clone the repository
|
| 98 |
git clone https://github.com/Ayush10/replicalab-ai.git
|
| 99 |
cd replicalab-ai
|
| 100 |
|
| 101 |
-
# Create a virtual environment
|
| 102 |
python -m venv .venv
|
| 103 |
-
source .venv/bin/activate #
|
| 104 |
|
| 105 |
-
# Install Python dependencies
|
| 106 |
pip install -e ".[dev]"
|
| 107 |
```
|
| 108 |
|
|
@@ -115,11 +133,10 @@ python -c "from replicalab.models import ScientistAction, LabManagerAction; prin
|
|
| 115 |
### Running the Environment Server
|
| 116 |
|
| 117 |
```bash
|
| 118 |
-
# Planned command once server wiring lands
|
| 119 |
python -m server.app
|
| 120 |
```
|
| 121 |
|
| 122 |
-
The server is intended to start at `http://localhost:7860`
|
| 123 |
|
| 124 |
### Running the Frontend
|
| 125 |
|
|
@@ -129,8 +146,6 @@ npm install
|
|
| 129 |
npm run dev
|
| 130 |
```
|
| 131 |
|
| 132 |
-
The React UI is intended to start at `http://localhost:5173` once the frontend shell and Vite config are in place.
|
| 133 |
-
|
| 134 |
### Running Tests
|
| 135 |
|
| 136 |
```bash
|
|
@@ -141,30 +156,36 @@ pytest tests/
|
|
| 141 |
|
| 142 |
## Training the Scientist
|
| 143 |
|
| 144 |
-
RL training improves the Scientist agent
|
| 145 |
|
| 146 |
-
### Selected
|
| 147 |
|
| 148 |
- **Primary Scientist model:** `Qwen3-4B`
|
| 149 |
- **Stretch fallback:** `Qwen3-8B`
|
| 150 |
- **Decision record:** `docs/agt11_scientist_model_selection.md`
|
| 151 |
|
| 152 |
-
###
|
| 153 |
|
| 154 |
-
1.
|
| 155 |
-
2.
|
| 156 |
-
3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
### Training Loop
|
| 159 |
|
| 160 |
-
```
|
| 161 |
-
|
| 162 |
```
|
| 163 |
|
| 164 |
-
|
|
|
|
| 165 |
- Ask better questions before committing to a plan
|
| 166 |
- Preserve critical checks, assumptions, and required steps
|
| 167 |
-
- Choose realistic substitutions when
|
| 168 |
- Reach agreement in fewer rounds
|
| 169 |
- Avoid impossible or over-budget plans
|
| 170 |
|
|
@@ -172,7 +193,7 @@ Environment resets -> Scientist proposes -> Lab Manager responds -> ... -> Episo
|
|
| 172 |
|
| 173 |
## Scenario System
|
| 174 |
|
| 175 |
-
Scenarios are generated deterministically from a seed. Each template
|
| 176 |
|
| 177 |
- `task_summary`
|
| 178 |
- `success_criteria`
|
|
@@ -181,7 +202,7 @@ Scenarios are generated deterministically from a seed. Each template first emits
|
|
| 181 |
- `allowed_substitutions`
|
| 182 |
- `hidden_reference_spec`
|
| 183 |
|
| 184 |
-
Difficulty scaling should
|
| 185 |
|
| 186 |
| Difficulty | Description |
|
| 187 |
|------------|-------------|
|
|
@@ -192,7 +213,7 @@ Difficulty scaling should then mechanically tighten constraints, remove resource
|
|
| 192 |
### Included Scenario Templates
|
| 193 |
|
| 194 |
| Template | Domain | Example Task |
|
| 195 |
-
|----------|--------|--------------
|
| 196 |
| `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
|
| 197 |
| `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
|
| 198 |
| `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
|
|
@@ -201,57 +222,68 @@ Difficulty scaling should then mechanically tighten constraints, remove resource
|
|
| 201 |
|
| 202 |
## Project Structure
|
| 203 |
|
| 204 |
-
```
|
| 205 |
replicalab-ai/
|
| 206 |
├── README.md
|
| 207 |
-
├──
|
| 208 |
├── pyproject.toml
|
| 209 |
├── openenv.yaml
|
| 210 |
├── replicalab/
|
| 211 |
│ ├── __init__.py
|
| 212 |
-
│ ├── models.py
|
| 213 |
-
│ ├── client.py
|
|
|
|
|
|
|
|
|
|
| 214 |
│ ├── prompts/
|
| 215 |
-
│ │ ├── scientist.txt
|
| 216 |
-
│ │ ├── lab_manager.txt
|
| 217 |
-
│ │
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
│ ├── scenarios/
|
| 219 |
-
│ │ ├── templates.py
|
| 220 |
-
│ │ ├── math_reasoning.py
|
| 221 |
-
│ │ ├── ml_benchmark.py
|
| 222 |
│ │ └── finance_trading.py
|
| 223 |
│ ├── scoring/
|
| 224 |
-
│ │ ├── rubric.py
|
| 225 |
-
│ │ ├── rigor.py
|
| 226 |
-
│ │ ├── feasibility.py
|
| 227 |
-
│ │
|
|
|
|
| 228 |
│ ├── agents/
|
| 229 |
│ │ ├── scientist_policy.py
|
| 230 |
│ │ ├── lab_manager_policy.py
|
|
|
|
| 231 |
│ │ └── judge_policy.py
|
| 232 |
│ ├── env/
|
| 233 |
-
│ │ └── replicalab_env.py
|
|
|
|
|
|
|
| 234 |
│ └── utils/
|
| 235 |
│ ├── seed.py
|
| 236 |
│ ├── validation.py
|
| 237 |
│ └── logging.py
|
| 238 |
├── server/
|
| 239 |
-
│ ├── app.py
|
| 240 |
│ ├── requirements.txt
|
| 241 |
│ └── Dockerfile
|
| 242 |
├── frontend/
|
| 243 |
│ ├── package.json
|
| 244 |
│ ├── vite.config.ts
|
| 245 |
│ └── src/
|
| 246 |
-
│ ├── App.tsx
|
| 247 |
-
│ ├── components/
|
| 248 |
-
│ └── pages/
|
| 249 |
├── notebooks/
|
| 250 |
-
│ └── train_colab.ipynb
|
| 251 |
└── tests/
|
| 252 |
├── test_env.py
|
| 253 |
├── test_reward.py
|
| 254 |
├── test_scenarios.py
|
|
|
|
|
|
|
| 255 |
└── test_server.py
|
| 256 |
```
|
| 257 |
|
|
@@ -268,16 +300,24 @@ docker run -p 7860:7860 replicalab
|
|
| 268 |
|
| 269 |
### Hugging Face Spaces
|
| 270 |
|
| 271 |
-
**Live deployment:** https://ayushozha-replicalab.hf.space
|
| 272 |
|
| 273 |
The app is deployed on HF Spaces with `sdk: docker` on port `7860`.
|
| 274 |
|
| 275 |
```bash
|
| 276 |
-
# Verify the live Space
|
| 277 |
curl https://ayushozha-replicalab.hf.space/health
|
| 278 |
-
#
|
| 279 |
```
|
| 280 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
---
|
| 282 |
|
| 283 |
## Toolchain
|
|
@@ -291,7 +331,7 @@ curl https://ayushozha-replicalab.hf.space/health
|
|
| 291 |
| **Tailwind + shadcn/ui** | Styling |
|
| 292 |
| **Docker** | Packaging |
|
| 293 |
| **Hugging Face Spaces** | Public hosting |
|
| 294 |
-
| **
|
| 295 |
|
| 296 |
---
|
| 297 |
|
|
|
|
| 1 |
---
|
| 2 |
title: ReplicaLab
|
| 3 |
+
emoji: "🧪"
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
|
|
|
| 14 |
|
| 15 |
> *How do we adapt a plan without breaking the objective?*
|
| 16 |
|
| 17 |
+
ReplicaLab trains a Scientist policy to negotiate better plans under real constraints. The initial domain focus is mathematics and machine learning, with offline finance and trading design as the third scenario family. Physics and biology remain future adapters after the core normalized scenario layer is stable.
|
| 18 |
|
| 19 |
## Current Build Status
|
| 20 |
|
| 21 |
+
- The repository is past the foundation stage and has a working real environment plus deterministic judge pipeline.
|
| 22 |
+
- The Python package foundation is verified through editable install plus the full test suite.
|
| 23 |
+
- Shared contracts live in `replicalab/models.py`, with the signed-off freeze in `docs/fnd08_frozen_json_contract.md`.
|
| 24 |
+
- `server/app.py` serves the real `ReplicaLabEnv` by default, with the legacy stub retained only as a fallback path.
|
| 25 |
- `openenv.yaml` exists and passes local OpenEnv validation.
|
| 26 |
- Local Docker validation has been completed for the server image on port `7860`.
|
| 27 |
+
- Hugging Face Spaces deployment is live at `https://ayushozha-replicalab.hf.space` for the deterministic environment path.
|
| 28 |
+
- The frozen outer contract remains stable while the internal scenario engine uses a normalized scenario pack.
|
| 29 |
+
- The Lab Manager path is hybrid: deterministic feasibility truth with optional model-backed narrative responses.
|
| 30 |
+
- An additive Oracle hybrid layer now exists for optional frontier-model world generation, event injection, Lab Manager narration, and post-mortem analysis while deterministic scoring remains the canonical RL reward path.
|
| 31 |
|
| 32 |
## Team Ownership
|
| 33 |
|
| 34 |
| Owner | Current focus |
|
| 35 |
|------|----------------|
|
| 36 |
+
| Kian (Person A) | Shared schemas, validation, scenario engine, judge logic |
|
| 37 |
+
| Person B (Ayush) | Scientist prompting and parsing, notebook and client path |
|
| 38 |
+
| Max (Person C) | Server, deployment, and runtime plumbing |
|
| 39 |
+
| Kush (Person D) | Frontend, UI polish, docs, and demo assets |
|
| 40 |
|
| 41 |
---
|
| 42 |
|
| 43 |
## Architecture
|
| 44 |
|
| 45 |
<p align="center">
|
| 46 |
+
<img src="./ReplicaLab_Architecture_v2.svg" alt="ReplicaLab Hybrid Architecture" width="100%"/>
|
| 47 |
</p>
|
| 48 |
|
| 49 |
+
ReplicaLab uses a **hybrid Oracle architecture**:
|
| 50 |
+
|
| 51 |
+
- The **Oracle layer** is optional and powers world-building and narrative intelligence:
|
| 52 |
+
- richer scenario generation
|
| 53 |
+
- optional event injection
|
| 54 |
+
- optional LLM Lab Manager narration
|
| 55 |
+
- optional post-mortem analysis
|
| 56 |
+
- The **deterministic core** remains canonical for RL:
|
| 57 |
+
- environment transitions
|
| 58 |
+
- validation
|
| 59 |
+
- grounded Lab Manager feasibility
|
| 60 |
+
- judge scoring and reward math
|
| 61 |
+
|
| 62 |
+
This satisfies the sponsor-facing “LLM as environment intelligence” direction without making reward noisy or irreproducible.
|
| 63 |
+
|
| 64 |
---
|
| 65 |
|
| 66 |
## How It Works
|
|
|
|
| 71 |
|------|------|----------------|
|
| 72 |
| **Scientist** | Trainable model policy | Proposes plans, asks questions, and preserves objective quality |
|
| 73 |
| **Lab Manager** | Hybrid model-backed policy with deterministic grounding | Negotiates revisions while the checker enforces feasibility and constraint truth |
|
| 74 |
+
| **Judge** | Deterministic rubric engine | Scores the final plan on rigor, feasibility, fidelity, and parsimony |
|
| 75 |
+
| **Oracle (optional)** | Frontier-model intelligence layer | Generates richer worlds, optional events, optional live LM narration, and post-mortem analysis |
|
| 76 |
|
| 77 |
### Episode Lifecycle
|
| 78 |
|
| 79 |
+
1. **Reset**: `reset(seed)` builds a normalized scenario pack and hidden reference spec.
|
| 80 |
+
2. **Scientist observes**: task summary, goal, history, and current plan.
|
| 81 |
+
3. **Lab Manager observes**: resource, scheduling, staffing, and policy constraints from the same normalized pack.
|
| 82 |
+
4. **Negotiation**: multiple rounds of proposals, counteroffers, and questions.
|
| 83 |
+
5. **Agreement or timeout**: both accept, or the round limit is reached.
|
| 84 |
+
6. **Reward**: the deterministic judge scores the final plan.
|
| 85 |
+
7. **Optional Oracle overlays**: event injection, round commentary, and post-mortem may be layered on top without replacing deterministic reward.
|
| 86 |
|
| 87 |
### Reward Formula
|
| 88 |
|
| 89 |
+
```text
|
| 90 |
+
total_reward = 10 * rigor * feasibility * fidelity * parsimony
|
| 91 |
+
+ efficiency_bonus
|
| 92 |
+
+ communication_bonus
|
| 93 |
+
- penalties
|
| 94 |
```
|
| 95 |
|
| 96 |
+
The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low, and a cheap but invalid plan also scores low. Even when the Oracle layer is enabled, this deterministic path remains canonical for RL training and before/after evaluation.
|
| 97 |
|
| 98 |
+
### Internal Normalization Rule
|
| 99 |
|
| 100 |
The outer action and observation models stay stable. Domain-specific content is converted into a normalized scenario pack first, then mapped into the current `ScientistObservation` and `LabManagerObservation` contracts. Prompts are assembled from that normalized data rather than hard-coded per domain.
|
| 101 |
|
|
|
|
| 103 |
|
| 104 |
## Getting Started
|
| 105 |
|
| 106 |
+
This section mixes verified foundation commands with planned end-to-end commands.
|
| 107 |
|
| 108 |
### Prerequisites
|
| 109 |
|
| 110 |
- Python 3.10+
|
| 111 |
+
- Node.js 18+
|
| 112 |
+
- Docker
|
| 113 |
+
- A notebook runtime such as Google Colab or the H100-backed Jupyter environment
|
| 114 |
|
| 115 |
### Installation
|
| 116 |
|
| 117 |
```bash
|
|
|
|
| 118 |
git clone https://github.com/Ayush10/replicalab-ai.git
|
| 119 |
cd replicalab-ai
|
| 120 |
|
|
|
|
| 121 |
python -m venv .venv
|
| 122 |
+
source .venv/bin/activate # Windows: .venv\Scripts\activate
|
| 123 |
|
|
|
|
| 124 |
pip install -e ".[dev]"
|
| 125 |
```
|
| 126 |
|
|
|
|
| 133 |
### Running the Environment Server
|
| 134 |
|
| 135 |
```bash
|
|
|
|
| 136 |
python -m server.app
|
| 137 |
```
|
| 138 |
|
| 139 |
+
The server is intended to start at `http://localhost:7860`.
|
| 140 |
|
| 141 |
### Running the Frontend
|
| 142 |
|
|
|
|
| 146 |
npm run dev
|
| 147 |
```
|
| 148 |
|
|
|
|
|
|
|
| 149 |
### Running Tests
|
| 150 |
|
| 151 |
```bash
|
|
|
|
| 156 |
|
| 157 |
## Training the Scientist
|
| 158 |
|
| 159 |
+
RL training improves the Scientist agent’s ability to negotiate effective, feasible plans.
|
| 160 |
|
| 161 |
+
### Selected Base Model
|
| 162 |
|
| 163 |
- **Primary Scientist model:** `Qwen3-4B`
|
| 164 |
- **Stretch fallback:** `Qwen3-8B`
|
| 165 |
- **Decision record:** `docs/agt11_scientist_model_selection.md`
|
| 166 |
|
| 167 |
+
### Planned Training Path
|
| 168 |
|
| 169 |
+
1. Connect the notebook to the environment via `replicalab/client.py`
|
| 170 |
+
2. Collect rollouts with `replicalab/training/rollout.py`
|
| 171 |
+
3. Train with **Unsloth or HF TRL**
|
| 172 |
+
4. Save:
|
| 173 |
+
- reward curves
|
| 174 |
+
- component curves
|
| 175 |
+
- before/after evaluation metrics
|
| 176 |
+
- replay and plot artifacts
|
| 177 |
|
| 178 |
### Training Loop
|
| 179 |
|
| 180 |
+
```text
|
| 181 |
+
reset -> Scientist acts -> Lab Manager responds -> ... -> episode ends -> deterministic reward -> policy update
|
| 182 |
```
|
| 183 |
|
| 184 |
+
### Target Behaviors Over Training
|
| 185 |
+
|
| 186 |
- Ask better questions before committing to a plan
|
| 187 |
- Preserve critical checks, assumptions, and required steps
|
| 188 |
+
- Choose realistic substitutions when preferred resources are unavailable
|
| 189 |
- Reach agreement in fewer rounds
|
| 190 |
- Avoid impossible or over-budget plans
|
| 191 |
|
|
|
|
| 193 |
|
| 194 |
## Scenario System
|
| 195 |
|
| 196 |
+
Scenarios are generated deterministically from a seed. Each template emits a normalized scenario pack with:
|
| 197 |
|
| 198 |
- `task_summary`
|
| 199 |
- `success_criteria`
|
|
|
|
| 202 |
- `allowed_substitutions`
|
| 203 |
- `hidden_reference_spec`
|
| 204 |
|
| 205 |
+
Difficulty scaling should mechanically tighten constraints, remove resources, or add conflicts instead of changing the outer contract or prompt structure.
|
| 206 |
|
| 207 |
| Difficulty | Description |
|
| 208 |
|------------|-------------|
|
|
|
|
| 213 |
### Included Scenario Templates
|
| 214 |
|
| 215 |
| Template | Domain | Example Task |
|
| 216 |
+
|----------|--------|--------------|
|
| 217 |
| `math_reasoning` | Mathematics | Proof planning under tool, review, and time constraints |
|
| 218 |
| `ml_benchmark` | Machine learning | Model evaluation with dataset, compute, and time constraints |
|
| 219 |
| `finance_trading` | Finance and trading | Offline strategy and backtest planning under risk and capital limits |
|
|
|
|
| 222 |
|
| 223 |
## Project Structure
|
| 224 |
|
| 225 |
+
```text
|
| 226 |
replicalab-ai/
|
| 227 |
├── README.md
|
| 228 |
+
├── ReplicaLab_Architecture_v2.svg
|
| 229 |
├── pyproject.toml
|
| 230 |
├── openenv.yaml
|
| 231 |
├── replicalab/
|
| 232 |
│ ├── __init__.py
|
| 233 |
+
│ ├── models.py # Action, Observation, State schemas
|
| 234 |
+
│ ├── client.py # OpenEnv client wrapper
|
| 235 |
+
│ ├── oracle.py # Optional frontier-model Oracle wrapper
|
| 236 |
+
│ ├── oracle_models.py # Oracle scenario and post-mortem schemas
|
| 237 |
+
│ ├── cache.py # Cached Oracle scenario generation
|
| 238 |
│ ├── prompts/
|
| 239 |
+
│ │ ├── scientist.txt
|
| 240 |
+
│ │ ├── lab_manager.txt
|
| 241 |
+
│ │ ├── judge.txt
|
| 242 |
+
│ │ ├── oracle_world_architect.txt
|
| 243 |
+
│ │ ├── oracle_adjudicator.txt
|
| 244 |
+
│ │ ├── oracle_event_injector.txt
|
| 245 |
+
│ │ ├── oracle_post_mortem.txt
|
| 246 |
+
│ │ └── oracle_lab_manager.txt
|
| 247 |
│ ├── scenarios/
|
| 248 |
+
│ │ ├── templates.py # Normalized scenario pack + Oracle adapter
|
| 249 |
+
│ │ ├── math_reasoning.py
|
| 250 |
+
│ │ ├── ml_benchmark.py
|
| 251 |
│ │ └── finance_trading.py
|
| 252 |
│ ├── scoring/
|
| 253 |
+
│ │ ├── rubric.py # Canonical deterministic reward math
|
| 254 |
+
│ │ ├── rigor.py
|
| 255 |
+
│ │ ├── feasibility.py
|
| 256 |
+
│ │ ├── fidelity.py
|
| 257 |
+
│ │ └── explain.py
|
| 258 |
│ ├── agents/
|
| 259 |
│ │ ├── scientist_policy.py
|
| 260 |
│ │ ├── lab_manager_policy.py
|
| 261 |
+
│ │ ├── lab_manager_agent.py # Optional LLM Lab Manager wrapper
|
| 262 |
│ │ └── judge_policy.py
|
| 263 |
│ ├── env/
|
| 264 |
+
│ │ └── replicalab_env.py # Real env with optional Oracle hooks
|
| 265 |
+
│ ├── training/
|
| 266 |
+
│ │ └── rollout.py
|
| 267 |
│ └── utils/
|
| 268 |
│ ├── seed.py
|
| 269 |
│ ├── validation.py
|
| 270 |
│ └── logging.py
|
| 271 |
├── server/
|
| 272 |
+
│ ├── app.py
|
| 273 |
│ ├── requirements.txt
|
| 274 |
│ └── Dockerfile
|
| 275 |
├── frontend/
|
| 276 |
│ ├── package.json
|
| 277 |
│ ├── vite.config.ts
|
| 278 |
│ └── src/
|
|
|
|
|
|
|
|
|
|
| 279 |
├── notebooks/
|
| 280 |
+
│ └── train_colab.ipynb
|
| 281 |
└── tests/
|
| 282 |
├── test_env.py
|
| 283 |
├── test_reward.py
|
| 284 |
├── test_scenarios.py
|
| 285 |
+
├── test_oracle.py
|
| 286 |
+
├── test_cache.py
|
| 287 |
└── test_server.py
|
| 288 |
```
|
| 289 |
|
|
|
|
| 300 |
|
| 301 |
### Hugging Face Spaces
|
| 302 |
|
| 303 |
+
**Live deployment:** `https://ayushozha-replicalab.hf.space`
|
| 304 |
|
| 305 |
The app is deployed on HF Spaces with `sdk: docker` on port `7860`.
|
| 306 |
|
| 307 |
```bash
|
|
|
|
| 308 |
curl https://ayushozha-replicalab.hf.space/health
|
| 309 |
+
# -> {"status":"ok","env":"real"}
|
| 310 |
```
|
| 311 |
|
| 312 |
+
Current Space deployment is complete for the deterministic environment path. If live Oracle mode is enabled later, the Space will additionally need:
|
| 313 |
+
|
| 314 |
+
- provider SDK dependencies
|
| 315 |
+
- model API-key secrets
|
| 316 |
+
- runtime feature flags
|
| 317 |
+
- cold-start and latency handling
|
| 318 |
+
|
| 319 |
+
The deterministic deployment itself does not need to be redesigned.
|
| 320 |
+
|
| 321 |
---
|
| 322 |
|
| 323 |
## Toolchain
|
|
|
|
| 331 |
| **Tailwind + shadcn/ui** | Styling |
|
| 332 |
| **Docker** | Packaging |
|
| 333 |
| **Hugging Face Spaces** | Public hosting |
|
| 334 |
+
| **Notebook / Colab / H100** | Training and evaluation |
|
| 335 |
|
| 336 |
---
|
| 337 |
|
ReplicaLab_Architecture_v2.svg
ADDED
|
|
Git LFS Details
|
ReplicaLab_Architecture_v2_polished.svg
ADDED
|
|
Git LFS Details
|
docs/changes.md
CHANGED
|
@@ -58,4 +58,6 @@ Rules:
|
|
| 58 |
| 2026-03-08 | Person B (Ayush) | TRN 04 | Implemented the rollout collection loop as a reusable Python module rather than only inside a notebook | The backlog labels `TRN 04` as notebook work, but implementing it in `replicalab/training/rollout.py` makes the same rollout logic reusable across notebooks, tests, and future trainer code while preserving the required behavior | Extended `RolloutWorker` with terminal `StepInfo`, bounded tool trace aggregation, and `collect_rollouts(...)`; added trace and batch tests in `tests/test_rollout_traces.py` and kept the rollout logic fully testable outside a notebook | `TRN 05` is now unblocked and notebooks can import the rollout loop instead of reimplementing it |
|
| 59 |
| 2026-03-08 | Person B (Ayush) | API 14 | Completed the REST session isolation verification even though the task was assigned to Person C | The session isolation logic already worked correctly in `server/app.py`; the task was still marked partial because no dedicated tests proved concurrent-user isolation against the real env | Created `tests/test_api_rest_isolation.py` with 11 tests covering session independence, round-count isolation, terminal isolation, session_id reuse, invalid session handling, and replay isolation; no server changes needed; 307 tests pass | No new dependencies unblocked; `API 14` was the last partial API task besides `API 01` and `OBS 02` |
|
| 60 |
| 2026-03-08 | Person B (Ayush) | MOD 07 and MOD 10 | Closed the replay-persistence and schema-example tasks on Max's lane after verifying the code that had already landed | `replicalab/utils/logging.py` and the API example generator were implemented and passing tests, but the source-of-truth backlog and Max's owner docs still showed both tasks as not started, and the generated examples still contained stale stub audit text | Updated `tests/fixtures/generate_api_examples.py` to derive terminal judge metadata from the current deterministic judge helpers, regenerated `api_schema_examples.json`, and synced `MOD 07`/`MOD 10` to complete in the comprehensive backlog, completion rollup, and Max owner docs | `MOD 08` and `JDG 07` are now clearly unblocked in the tracked plan |
|
|
|
|
|
|
|
| 61 |
|
|
|
|
| 58 |
| 2026-03-08 | Person B (Ayush) | TRN 04 | Implemented the rollout collection loop as a reusable Python module rather than only inside a notebook | The backlog labels `TRN 04` as notebook work, but implementing it in `replicalab/training/rollout.py` makes the same rollout logic reusable across notebooks, tests, and future trainer code while preserving the required behavior | Extended `RolloutWorker` with terminal `StepInfo`, bounded tool trace aggregation, and `collect_rollouts(...)`; added trace and batch tests in `tests/test_rollout_traces.py` and kept the rollout logic fully testable outside a notebook | `TRN 05` is now unblocked and notebooks can import the rollout loop instead of reimplementing it |
|
| 59 |
| 2026-03-08 | Person B (Ayush) | API 14 | Completed the REST session isolation verification even though the task was assigned to Person C | The session isolation logic already worked correctly in `server/app.py`; the task was still marked partial because no dedicated tests proved concurrent-user isolation against the real env | Created `tests/test_api_rest_isolation.py` with 11 tests covering session independence, round-count isolation, terminal isolation, session_id reuse, invalid session handling, and replay isolation; no server changes needed; 307 tests pass | No new dependencies unblocked; `API 14` was the last partial API task besides `API 01` and `OBS 02` |
|
| 60 |
| 2026-03-08 | Person B (Ayush) | MOD 07 and MOD 10 | Closed the replay-persistence and schema-example tasks on Max's lane after verifying the code that had already landed | `replicalab/utils/logging.py` and the API example generator were implemented and passing tests, but the source-of-truth backlog and Max's owner docs still showed both tasks as not started, and the generated examples still contained stale stub audit text | Updated `tests/fixtures/generate_api_examples.py` to derive terminal judge metadata from the current deterministic judge helpers, regenerated `api_schema_examples.json`, and synced `MOD 07`/`MOD 10` to complete in the comprehensive backlog, completion rollup, and Max owner docs | `MOD 08` and `JDG 07` are now clearly unblocked in the tracked plan |
|
| 61 |
+
| 2026-03-08 | Person B (Ayush) | Reward shaping and rubric refinement | Expanded the reward system beyond terminal-only scoring without reopening the outer action or observation contract | Sparse terminal-only reward was too weak for RL training, and the project needed deterministic shaping rather than a frontier-model reward source | Added a parsimony term to terminal reward, introduced deterministic step shaping in `ReplicaLabEnv` (information gain, protocol delta, momentum, contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement signals), updated rollout aggregation to use cumulative episode reward, and aligned env/server tests to the new shaped-reward semantics while keeping the full suite green at 356 tests | Keep the notebook and training plots explicit about terminal reward components vs cumulative shaped episode reward |
|
| 62 |
+
| 2026-03-08 | Person B (Ayush) | Oracle hybrid architecture | Added an Oracle-style frontier-model layer as an additive integration instead of replacing the deterministic environment and reward stack | The sponsor-facing V2 direction calls for an LLM woven through scenario generation, environment interaction, and explanation, but the RL training path still needs deterministic reward and reproducible evaluation | Added `oracle_models.py`, `oracle.py`, `cache.py`, Oracle prompt assets, an optional LLM Lab Manager wrapper, an adapter from Oracle scenarios into the existing normalized scenario pack, and feature-flagged Oracle hooks in `ReplicaLabEnv`; kept deterministic scoring in `replicalab/scoring/*` as the canonical training reward; expanded test coverage with `test_oracle.py`, `test_cache.py`, and Oracle adapter/prompt tests; full suite now passes at 365 tests | If this grows beyond the current additive mode, record any future contract or reward-source changes separately before altering the deterministic training path |
|
| 63 |
|
docs/map/scoring.md
CHANGED
|
@@ -6,6 +6,18 @@
|
|
| 6 |
> **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08
|
| 7 |
> **Tasks remaining:** JDG 07
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
## Architecture
|
| 10 |
|
| 11 |
```
|
|
@@ -19,6 +31,24 @@ replicalab/scoring/
|
|
| 19 |
explain.py # JDG 06 — deterministic plain-English explanation
|
| 20 |
```
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
## Shared Utilities
|
| 23 |
|
| 24 |
Token matching extracted into `replicalab/utils/text.py`:
|
|
|
|
| 6 |
> **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08
|
| 7 |
> **Tasks remaining:** JDG 07
|
| 8 |
|
| 9 |
+
## Oracle Hybrid Note
|
| 10 |
+
|
| 11 |
+
The repo now includes an additive Oracle layer for richer scenario generation,
|
| 12 |
+
optional Lab Manager narration, optional event injection, and post-mortem
|
| 13 |
+
analysis. None of that replaces the files in `replicalab/scoring/`.
|
| 14 |
+
|
| 15 |
+
For RL training, this folder remains the canonical reward source:
|
| 16 |
+
- deterministic
|
| 17 |
+
- reproducible
|
| 18 |
+
- testable
|
| 19 |
+
- used by the environment for the actual scalar reward signal
|
| 20 |
+
|
| 21 |
## Architecture
|
| 22 |
|
| 23 |
```
|
|
|
|
| 31 |
explain.py # JDG 06 — deterministic plain-English explanation
|
| 32 |
```
|
| 33 |
|
| 34 |
+
## Current Reward Structure
|
| 35 |
+
|
| 36 |
+
The training signal now has two layers:
|
| 37 |
+
|
| 38 |
+
- **Terminal reward** from `replicalab/scoring/rubric.py`
|
| 39 |
+
- `10 * rigor * feasibility * fidelity * parsimony`
|
| 40 |
+
- plus bonuses
|
| 41 |
+
- minus named penalties
|
| 42 |
+
- **Step shaping reward** from `replicalab/env/replicalab_env.py`
|
| 43 |
+
- information-gain bonus for novel questions
|
| 44 |
+
- protocol-delta and momentum bonuses for productive revisions
|
| 45 |
+
- contradiction, hallucination, stalling, regression, invalid-action,
|
| 46 |
+
timeout, and no-agreement penalties
|
| 47 |
+
|
| 48 |
+
The judge remains deterministic. The terminal audit still explains the final
|
| 49 |
+
`RewardBreakdown`, while cumulative episode reward now includes the per-step
|
| 50 |
+
shaping applied inside the environment.
|
| 51 |
+
|
| 52 |
## Shared Utilities
|
| 53 |
|
| 54 |
Token matching extracted into `replicalab/utils/text.py`:
|
docs/map/tests.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Tests Map - `tests/`
|
| 2 |
|
| 3 |
-
>
|
| 4 |
>
|
| 5 |
> **Last verified:** 2026-03-08
|
| 6 |
|
|
@@ -8,21 +8,25 @@
|
|
| 8 |
|
| 9 |
| File | Tests | What it covers |
|
| 10 |
|------|-------|----------------|
|
|
|
|
|
|
|
| 11 |
| `test_client.py` | 24 | `TRN 13` reusable client over REST and WebSocket |
|
| 12 |
| `test_config.py` | 3 | Shared constants and config consistency |
|
| 13 |
| `test_env.py` | 56 | `ENV 01-08`, `ENV 10`, `ENV 11`, `OBS 04`, `JDG 04-05`, `TST 01-03` |
|
| 14 |
| `test_judge_policy.py` | 10 | `JDG 11` structured judge audit payload |
|
| 15 |
| `test_lab_manager_policy.py` | 37 | `AGT 05-07` plus `AGT 09` determinism coverage |
|
| 16 |
| `test_models.py` | 21 | Action, observation, step, state, and log contracts |
|
| 17 |
-
| `
|
|
|
|
|
|
|
| 18 |
| `test_reward.py` | 40 | `JDG 01-06`, `JDG 08`, and reward regression coverage |
|
| 19 |
| `test_rollout.py` | 12 | `TRN 03` rollout worker behavior |
|
| 20 |
| `test_rollout_traces.py` | 2 | `TRN 04` bounded tool trace aggregation and batched collection |
|
| 21 |
-
| `test_scenarios.py` |
|
| 22 |
| `test_scientist_policy.py` | 46 | `MOD 09`, `AGT 01-04`, `AGT 08` |
|
| 23 |
-
| `test_server.py` |
|
| 24 |
| `test_validation.py` | 20 | `MOD 05-06` semantic validation |
|
| 25 |
-
| **Total** | **
|
| 26 |
|
| 27 |
## Coverage Notes
|
| 28 |
|
|
@@ -34,6 +38,8 @@
|
|
| 34 |
- `test_scientist_policy.py`, `test_prompts.py`, `test_rollout.py`, and `test_rollout_traces.py` together cover prompt construction, observation formatting, parse/retry, baseline policy, rollout collection, and bounded tool trace capture.
|
| 35 |
- The judge stack is covered end to end:
|
| 36 |
- `test_reward.py` covers rubric scores and reward math, while `test_judge_policy.py` covers structured audit payload generation.
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Remaining Gaps
|
| 39 |
|
|
@@ -47,6 +53,7 @@
|
|
| 47 |
|------|--------------------|
|
| 48 |
| Models and contracts | `test_models.py`, `test_validation.py` |
|
| 49 |
| Scenarios | `test_scenarios.py` |
|
|
|
|
| 50 |
| Scientist policy | `test_scientist_policy.py`, `test_prompts.py` |
|
| 51 |
| Lab Manager policy | `test_lab_manager_policy.py` |
|
| 52 |
| Judge and reward | `test_reward.py`, `test_judge_policy.py` |
|
|
|
|
| 1 |
# Tests Map - `tests/`
|
| 2 |
|
| 3 |
+
> 365 tests across 18 files. All passing.
|
| 4 |
>
|
| 5 |
> **Last verified:** 2026-03-08
|
| 6 |
|
|
|
|
| 8 |
|
| 9 |
| File | Tests | What it covers |
|
| 10 |
|------|-------|----------------|
|
| 11 |
+
| `test_api_rest_isolation.py` | 11 | `API 14` REST session isolation and replay separation |
|
| 12 |
+
| `test_cache.py` | 2 | Oracle scenario caching and reuse |
|
| 13 |
| `test_client.py` | 24 | `TRN 13` reusable client over REST and WebSocket |
|
| 14 |
| `test_config.py` | 3 | Shared constants and config consistency |
|
| 15 |
| `test_env.py` | 56 | `ENV 01-08`, `ENV 10`, `ENV 11`, `OBS 04`, `JDG 04-05`, `TST 01-03` |
|
| 16 |
| `test_judge_policy.py` | 10 | `JDG 11` structured judge audit payload |
|
| 17 |
| `test_lab_manager_policy.py` | 37 | `AGT 05-07` plus `AGT 09` determinism coverage |
|
| 18 |
| `test_models.py` | 21 | Action, observation, step, state, and log contracts |
|
| 19 |
+
| `test_logging.py` | 11 | `MOD 07` replay persistence and `JDG 07` CSV logging helpers |
|
| 20 |
+
| `test_oracle.py` | 5 | Oracle hybrid wrapper, structured parsing, and env reset adapter |
|
| 21 |
+
| `test_prompts.py` | 7 | `AGT 10` prompt files and Oracle prompt asset loading |
|
| 22 |
| `test_reward.py` | 40 | `JDG 01-06`, `JDG 08`, and reward regression coverage |
|
| 23 |
| `test_rollout.py` | 12 | `TRN 03` rollout worker behavior |
|
| 24 |
| `test_rollout_traces.py` | 2 | `TRN 04` bounded tool trace aggregation and batched collection |
|
| 25 |
+
| `test_scenarios.py` | 14 | `SCN 01-13` scenario generation, determinism, and Oracle scenario adaptation |
|
| 26 |
| `test_scientist_policy.py` | 46 | `MOD 09`, `AGT 01-04`, `AGT 08` |
|
| 27 |
+
| `test_server.py` | 44 | `API 01-04`, `API 06-08`, `API 13-14`, replay audit propagation, and root landing page |
|
| 28 |
| `test_validation.py` | 20 | `MOD 05-06` semantic validation |
|
| 29 |
+
| **Total** | **365** | |
|
| 30 |
|
| 31 |
## Coverage Notes
|
| 32 |
|
|
|
|
| 38 |
- `test_scientist_policy.py`, `test_prompts.py`, `test_rollout.py`, and `test_rollout_traces.py` together cover prompt construction, observation formatting, parse/retry, baseline policy, rollout collection, and bounded tool trace capture.
|
| 39 |
- The judge stack is covered end to end:
|
| 40 |
- `test_reward.py` covers rubric scores and reward math, while `test_judge_policy.py` covers structured audit payload generation.
|
| 41 |
+
- The Oracle hybrid layer is covered additively:
|
| 42 |
+
- `test_oracle.py`, `test_cache.py`, and `test_prompts.py` cover Oracle scenario generation wrappers, cache reuse, and prompt asset loading without changing the deterministic reward contract.
|
| 43 |
|
| 44 |
## Remaining Gaps
|
| 45 |
|
|
|
|
| 53 |
|------|--------------------|
|
| 54 |
| Models and contracts | `test_models.py`, `test_validation.py` |
|
| 55 |
| Scenarios | `test_scenarios.py` |
|
| 56 |
+
| Oracle integration and cache | `test_oracle.py`, `test_cache.py`, `test_prompts.py` |
|
| 57 |
| Scientist policy | `test_scientist_policy.py`, `test_prompts.py` |
|
| 58 |
| Lab Manager policy | `test_lab_manager_policy.py` |
|
| 59 |
| Judge and reward | `test_reward.py`, `test_judge_policy.py` |
|
replicalab/__init__.py
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
| 1 |
from replicalab.client import ReplicaLabClient
|
|
|
|
| 2 |
|
| 3 |
-
__all__ = ["ReplicaLabClient"]
|
|
|
|
| 1 |
+
from replicalab.cache import CachedOracle, ScenarioCache
|
| 2 |
from replicalab.client import ReplicaLabClient
|
| 3 |
+
from replicalab.oracle import Oracle
|
| 4 |
|
| 5 |
+
__all__ = ["CachedOracle", "Oracle", "ReplicaLabClient", "ScenarioCache"]
|
replicalab/agents/__init__.py
CHANGED
|
@@ -4,6 +4,7 @@ from .judge_policy import (
|
|
| 4 |
JudgeAudit,
|
| 5 |
build_judge_audit,
|
| 6 |
)
|
|
|
|
| 7 |
from .lab_manager_policy import (
|
| 8 |
AlternativeSuggestion,
|
| 9 |
FeasibilityCheckResult,
|
|
@@ -27,6 +28,7 @@ __all__ = [
|
|
| 27 |
"AlternativeSuggestion",
|
| 28 |
"FeasibilityCheckResult",
|
| 29 |
"JudgeAudit",
|
|
|
|
| 30 |
"RetryMetadata",
|
| 31 |
"ScientistCallResult",
|
| 32 |
"ScientistOutputParseError",
|
|
|
|
| 4 |
JudgeAudit,
|
| 5 |
build_judge_audit,
|
| 6 |
)
|
| 7 |
+
from .lab_manager_agent import LabManagerAgent
|
| 8 |
from .lab_manager_policy import (
|
| 9 |
AlternativeSuggestion,
|
| 10 |
FeasibilityCheckResult,
|
|
|
|
| 28 |
"AlternativeSuggestion",
|
| 29 |
"FeasibilityCheckResult",
|
| 30 |
"JudgeAudit",
|
| 31 |
+
"LabManagerAgent",
|
| 32 |
"RetryMetadata",
|
| 33 |
"ScientistCallResult",
|
| 34 |
"ScientistOutputParseError",
|
replicalab/agents/judge_policy.py
CHANGED
|
@@ -109,6 +109,7 @@ def _derive_failure_reasons(
|
|
| 109 |
(breakdown.feasibility, "feasibility", "Feasibility remained too low under the scenario constraints."),
|
| 110 |
(breakdown.fidelity, "fidelity", "The final plan diverged too far from the hidden reference requirements."),
|
| 111 |
(breakdown.rigor, "rigor", "The plan missed required checks or justification quality targets."),
|
|
|
|
| 112 |
]
|
| 113 |
for score, _name, message in components:
|
| 114 |
if score < _WEAK_THRESHOLD:
|
|
@@ -119,6 +120,13 @@ def _derive_failure_reasons(
|
|
| 119 |
_PENALTY_LABELS: dict[str, str] = {
|
| 120 |
"invalid_tool_use": "A bounded-tool usage violation was detected.",
|
| 121 |
"unsupported_claim": "An unsupported evidence claim was penalized.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
}
|
| 123 |
for key, amount in sorted(breakdown.penalties.items()):
|
| 124 |
if amount > 0:
|
|
|
|
| 109 |
(breakdown.feasibility, "feasibility", "Feasibility remained too low under the scenario constraints."),
|
| 110 |
(breakdown.fidelity, "fidelity", "The final plan diverged too far from the hidden reference requirements."),
|
| 111 |
(breakdown.rigor, "rigor", "The plan missed required checks or justification quality targets."),
|
| 112 |
+
(breakdown.parsimony, "parsimony", "The final plan requested more resources or controls than the scenario complexity justified."),
|
| 113 |
]
|
| 114 |
for score, _name, message in components:
|
| 115 |
if score < _WEAK_THRESHOLD:
|
|
|
|
| 120 |
_PENALTY_LABELS: dict[str, str] = {
|
| 121 |
"invalid_tool_use": "A bounded-tool usage violation was detected.",
|
| 122 |
"unsupported_claim": "An unsupported evidence claim was penalized.",
|
| 123 |
+
"timeout": "A timeout penalty was applied at the round limit.",
|
| 124 |
+
"no_agreement": "A no-agreement penalty was applied.",
|
| 125 |
+
"invalid_action": "An invalid action penalty was applied after a failed protocol proposal.",
|
| 126 |
+
"hallucination": "A hallucination penalty was applied for unsupported inventory references.",
|
| 127 |
+
"contradiction": "A contradiction penalty was applied for repeating blocked requirements.",
|
| 128 |
+
"stalling": "A stalling penalty was applied for repeating an unproductive move.",
|
| 129 |
+
"regression": "A regression penalty was applied because the revision worsened the protocol.",
|
| 130 |
}
|
| 131 |
for key, amount in sorted(breakdown.penalties.items()):
|
| 132 |
if amount > 0:
|
replicalab/agents/lab_manager_agent.py
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Optional LLM-backed Lab Manager narration layer."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import json
|
| 6 |
+
from typing import Any
|
| 7 |
+
|
| 8 |
+
from replicalab.oracle import call_json_model
|
| 9 |
+
from replicalab.oracle_models import LabManagerResponse, OracleLabManagerObservation
|
| 10 |
+
from replicalab.prompts import load_prompt_asset
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class LabManagerAgent:
|
| 14 |
+
"""LLM-based Lab Manager driven by Oracle-generated constraints.
|
| 15 |
+
|
| 16 |
+
This is additive to the deterministic feasibility checker. The current
|
| 17 |
+
env can use this agent to narrate or enrich responses while keeping
|
| 18 |
+
canonical feasibility and reward logic deterministic.
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
def __init__(self, client: Any, model: str = "frontier-oracle") -> None:
|
| 22 |
+
self.client = client
|
| 23 |
+
self.model = model
|
| 24 |
+
|
| 25 |
+
def respond(self, observation: OracleLabManagerObservation) -> LabManagerResponse:
|
| 26 |
+
system = load_prompt_asset("oracle_lab_manager")
|
| 27 |
+
user = (
|
| 28 |
+
"A Scientist has taken an action. Respond as the Lab Manager.\n\n"
|
| 29 |
+
"YOUR LAB CONSTRAINTS (ground truth, do not deviate):\n"
|
| 30 |
+
f"{observation.lab_constraints.model_dump_json(indent=2)}\n\n"
|
| 31 |
+
"CURRENT PROTOCOL ON THE TABLE:\n"
|
| 32 |
+
f"{json.dumps(observation.current_protocol, indent=2) if observation.current_protocol else 'None yet'}\n\n"
|
| 33 |
+
f"SCIENTIST'S ACTION (round {observation.round_number}):\n"
|
| 34 |
+
f"{observation.scientist_action.model_dump_json(indent=2)}\n\n"
|
| 35 |
+
"Respond ONLY with valid JSON matching LabManagerResponse.\n"
|
| 36 |
+
"No markdown. No preamble."
|
| 37 |
+
)
|
| 38 |
+
return call_json_model(
|
| 39 |
+
self.client,
|
| 40 |
+
model=self.model,
|
| 41 |
+
system=system,
|
| 42 |
+
user=user,
|
| 43 |
+
response_model=LabManagerResponse,
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
__all__ = ["LabManagerAgent"]
|
replicalab/cache.py
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Scenario caching for Oracle-generated environments."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import hashlib
|
| 6 |
+
import json
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
from typing import Optional
|
| 9 |
+
|
| 10 |
+
from replicalab.config import ORACLE_SCENARIO_CACHE_DIR
|
| 11 |
+
from replicalab.oracle import Oracle
|
| 12 |
+
from replicalab.oracle_models import Scenario
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class ScenarioCache:
|
| 16 |
+
"""Cache Oracle-generated scenarios by seed, difficulty, and domain."""
|
| 17 |
+
|
| 18 |
+
def __init__(self, cache_dir: str | Path = ORACLE_SCENARIO_CACHE_DIR) -> None:
|
| 19 |
+
self.cache_dir = Path(cache_dir)
|
| 20 |
+
self.cache_dir.mkdir(parents=True, exist_ok=True)
|
| 21 |
+
|
| 22 |
+
def _key(self, seed: int, difficulty: str, domain: str) -> str:
|
| 23 |
+
raw = f"{seed}:{difficulty}:{domain}"
|
| 24 |
+
return hashlib.md5(raw.encode("utf-8")).hexdigest()
|
| 25 |
+
|
| 26 |
+
def _path(self, seed: int, difficulty: str, domain: str) -> Path:
|
| 27 |
+
return self.cache_dir / f"{self._key(seed, difficulty, domain)}.json"
|
| 28 |
+
|
| 29 |
+
def get(self, seed: int, difficulty: str, domain: str) -> Optional[Scenario]:
|
| 30 |
+
path = self._path(seed, difficulty, domain)
|
| 31 |
+
if not path.exists():
|
| 32 |
+
return None
|
| 33 |
+
return Scenario.model_validate(json.loads(path.read_text(encoding="utf-8")))
|
| 34 |
+
|
| 35 |
+
def put(self, seed: int, difficulty: str, domain: str, scenario: Scenario) -> Path:
|
| 36 |
+
path = self._path(seed, difficulty, domain)
|
| 37 |
+
path.write_text(scenario.model_dump_json(indent=2), encoding="utf-8")
|
| 38 |
+
return path
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
class CachedOracle(Oracle):
|
| 42 |
+
"""Oracle wrapper that caches scenario generation by seed."""
|
| 43 |
+
|
| 44 |
+
def __init__(
|
| 45 |
+
self,
|
| 46 |
+
client: object,
|
| 47 |
+
model: str = "frontier-oracle",
|
| 48 |
+
*,
|
| 49 |
+
cache: ScenarioCache | None = None,
|
| 50 |
+
) -> None:
|
| 51 |
+
super().__init__(client=client, model=model)
|
| 52 |
+
self.cache = cache or ScenarioCache()
|
| 53 |
+
|
| 54 |
+
def generate_scenario(self, seed: int, difficulty: str, domain: str) -> Scenario:
|
| 55 |
+
cached = self.cache.get(seed, difficulty, domain)
|
| 56 |
+
if cached is not None:
|
| 57 |
+
return cached
|
| 58 |
+
scenario = super().generate_scenario(seed=seed, difficulty=difficulty, domain=domain)
|
| 59 |
+
self.cache.put(seed, difficulty, domain, scenario)
|
| 60 |
+
return scenario
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
__all__ = [
|
| 64 |
+
"CachedOracle",
|
| 65 |
+
"ScenarioCache",
|
| 66 |
+
]
|
replicalab/config.py
CHANGED
|
@@ -29,3 +29,29 @@ API_PORT = 7860
|
|
| 29 |
|
| 30 |
LOG_LEVEL = os.environ.get("REPLICALAB_LOG_LEVEL", "INFO").upper()
|
| 31 |
LOG_FORMAT = "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
LOG_LEVEL = os.environ.get("REPLICALAB_LOG_LEVEL", "INFO").upper()
|
| 31 |
LOG_FORMAT = "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
|
| 32 |
+
|
| 33 |
+
ORACLE_ENABLED = os.environ.get("REPLICALAB_ORACLE_ENABLED", "0") == "1"
|
| 34 |
+
ORACLE_EVENTS_ENABLED = os.environ.get("REPLICALAB_ORACLE_EVENTS_ENABLED", "0") == "1"
|
| 35 |
+
ORACLE_POST_MORTEM_ENABLED = (
|
| 36 |
+
os.environ.get("REPLICALAB_ORACLE_POST_MORTEM_ENABLED", "0") == "1"
|
| 37 |
+
)
|
| 38 |
+
ORACLE_MODEL = os.environ.get("REPLICALAB_ORACLE_MODEL", "frontier-oracle")
|
| 39 |
+
ORACLE_SCENARIO_CACHE_DIR = os.environ.get(
|
| 40 |
+
"REPLICALAB_ORACLE_SCENARIO_CACHE_DIR",
|
| 41 |
+
".scenario_cache",
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
# Deterministic reward shaping constants.
|
| 45 |
+
STEP_PROTOCOL_DELTA_SCALE = 0.25
|
| 46 |
+
STEP_PROTOCOL_DELTA_CAP = 0.3
|
| 47 |
+
STEP_INFO_GAIN_BONUS = 0.05
|
| 48 |
+
STEP_INFO_GAIN_CAP = 0.15
|
| 49 |
+
STEP_MOMENTUM_BONUS = 0.05
|
| 50 |
+
STEP_STALLING_PENALTY = 0.05
|
| 51 |
+
STEP_REPEATED_QUESTION_PENALTY = 0.03
|
| 52 |
+
STEP_REGRESSION_PENALTY = 0.1
|
| 53 |
+
STEP_CONTRADICTION_PENALTY = 0.05
|
| 54 |
+
STEP_INVALID_ACTION_PENALTY = 0.1
|
| 55 |
+
STEP_HALLUCINATION_PENALTY = 0.05
|
| 56 |
+
TERMINAL_TIMEOUT_PENALTY = 0.2
|
| 57 |
+
TERMINAL_NO_AGREEMENT_PENALTY = 0.1
|
replicalab/models.py
CHANGED
|
@@ -318,6 +318,9 @@ class RewardBreakdown(BaseModel):
|
|
| 318 |
rigor: float = Field(default=0.0, ge=0, le=1)
|
| 319 |
feasibility: float = Field(default=0.0, ge=0, le=1)
|
| 320 |
fidelity: float = Field(default=0.0, ge=0, le=1)
|
|
|
|
|
|
|
|
|
|
| 321 |
efficiency_bonus: float = 0.0
|
| 322 |
communication_bonus: float = 0.0
|
| 323 |
penalties: dict[str, float] = Field(default_factory=dict)
|
|
|
|
| 318 |
rigor: float = Field(default=0.0, ge=0, le=1)
|
| 319 |
feasibility: float = Field(default=0.0, ge=0, le=1)
|
| 320 |
fidelity: float = Field(default=0.0, ge=0, le=1)
|
| 321 |
+
# Defaults to 1.0 so existing exact-value tests and manual breakdowns
|
| 322 |
+
# preserve the prior reward semantics unless parsimony is computed.
|
| 323 |
+
parsimony: float = Field(default=1.0, ge=0, le=1)
|
| 324 |
efficiency_bonus: float = 0.0
|
| 325 |
communication_bonus: float = 0.0
|
| 326 |
penalties: dict[str, float] = Field(default_factory=dict)
|
replicalab/oracle.py
ADDED
|
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Optional frontier-model Oracle wrapper for ReplicaLab.
|
| 2 |
+
|
| 3 |
+
The Oracle is an additive intelligence layer. It can generate richer
|
| 4 |
+
scenarios, optional round commentary, optional events, and post-mortem
|
| 5 |
+
analyses, while the existing deterministic reward pipeline remains
|
| 6 |
+
canonical for RL training.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import json
|
| 12 |
+
from typing import Any, Optional, TypeVar
|
| 13 |
+
|
| 14 |
+
from pydantic import BaseModel
|
| 15 |
+
|
| 16 |
+
from replicalab.oracle_models import (
|
| 17 |
+
AdjudicatorRoundScore,
|
| 18 |
+
AdjudicatorTerminalScore,
|
| 19 |
+
EnvironmentEvent,
|
| 20 |
+
LabManagerResponse,
|
| 21 |
+
PostMortem,
|
| 22 |
+
Scenario,
|
| 23 |
+
)
|
| 24 |
+
from replicalab.prompts import load_prompt_asset
|
| 25 |
+
|
| 26 |
+
T = TypeVar("T", bound=BaseModel)
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def _strip_markdown_fences(text: str) -> str:
|
| 30 |
+
cleaned = text.strip()
|
| 31 |
+
if cleaned.startswith("```"):
|
| 32 |
+
lines = cleaned.splitlines()
|
| 33 |
+
if lines:
|
| 34 |
+
lines = lines[1:]
|
| 35 |
+
if lines and lines[-1].strip() == "```":
|
| 36 |
+
lines = lines[:-1]
|
| 37 |
+
cleaned = "\n".join(lines).strip()
|
| 38 |
+
return cleaned
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _extract_response_text(response: Any) -> str:
|
| 42 |
+
if isinstance(response, str):
|
| 43 |
+
return response
|
| 44 |
+
|
| 45 |
+
output_text = getattr(response, "output_text", None)
|
| 46 |
+
if output_text:
|
| 47 |
+
return output_text
|
| 48 |
+
|
| 49 |
+
content = getattr(response, "content", None)
|
| 50 |
+
if content:
|
| 51 |
+
chunks: list[str] = []
|
| 52 |
+
for item in content:
|
| 53 |
+
text = getattr(item, "text", None)
|
| 54 |
+
if text:
|
| 55 |
+
chunks.append(text)
|
| 56 |
+
if chunks:
|
| 57 |
+
return "\n".join(chunks)
|
| 58 |
+
|
| 59 |
+
output = getattr(response, "output", None)
|
| 60 |
+
if output:
|
| 61 |
+
parts: list[str] = []
|
| 62 |
+
for item in output:
|
| 63 |
+
inner = getattr(item, "content", None)
|
| 64 |
+
if not inner:
|
| 65 |
+
continue
|
| 66 |
+
for piece in inner:
|
| 67 |
+
text = getattr(piece, "text", None)
|
| 68 |
+
if text:
|
| 69 |
+
parts.append(text)
|
| 70 |
+
if parts:
|
| 71 |
+
return "\n".join(parts)
|
| 72 |
+
|
| 73 |
+
raise ValueError("Could not extract text from Oracle client response")
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def _invoke_client(client: Any, *, model: str, system: str, user: str) -> str:
|
| 77 |
+
if hasattr(client, "messages") and hasattr(client.messages, "create"):
|
| 78 |
+
response = client.messages.create(
|
| 79 |
+
model=model,
|
| 80 |
+
max_tokens=4096,
|
| 81 |
+
system=system,
|
| 82 |
+
messages=[{"role": "user", "content": user}],
|
| 83 |
+
)
|
| 84 |
+
return _extract_response_text(response)
|
| 85 |
+
|
| 86 |
+
if hasattr(client, "responses") and hasattr(client.responses, "create"):
|
| 87 |
+
response = client.responses.create(
|
| 88 |
+
model=model,
|
| 89 |
+
instructions=system,
|
| 90 |
+
input=user,
|
| 91 |
+
)
|
| 92 |
+
return _extract_response_text(response)
|
| 93 |
+
|
| 94 |
+
if callable(client):
|
| 95 |
+
try:
|
| 96 |
+
response = client(system=system, user=user, model=model)
|
| 97 |
+
except TypeError:
|
| 98 |
+
response = client(system, user)
|
| 99 |
+
return _extract_response_text(response)
|
| 100 |
+
|
| 101 |
+
raise TypeError("Unsupported Oracle client: expected Anthropic/OpenAI-style client or callable")
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def call_json_model(
|
| 105 |
+
client: Any,
|
| 106 |
+
*,
|
| 107 |
+
model: str,
|
| 108 |
+
system: str,
|
| 109 |
+
user: str,
|
| 110 |
+
response_model: type[T],
|
| 111 |
+
) -> T:
|
| 112 |
+
raw = _invoke_client(client, model=model, system=system, user=user)
|
| 113 |
+
cleaned = _strip_markdown_fences(raw)
|
| 114 |
+
data = json.loads(cleaned)
|
| 115 |
+
return response_model.model_validate(data)
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
class Oracle:
|
| 119 |
+
"""Single frontier model operating in multiple roles/personas."""
|
| 120 |
+
|
| 121 |
+
def __init__(self, client: Any, model: str = "frontier-oracle") -> None:
|
| 122 |
+
self.client = client
|
| 123 |
+
self.model = model
|
| 124 |
+
|
| 125 |
+
def generate_scenario(self, seed: int, difficulty: str, domain: str) -> Scenario:
|
| 126 |
+
system = load_prompt_asset("oracle_world_architect")
|
| 127 |
+
user = (
|
| 128 |
+
"Generate a complete replication scenario.\n\n"
|
| 129 |
+
f"Seed: {seed}\n"
|
| 130 |
+
f"Difficulty: {difficulty}\n"
|
| 131 |
+
f"Domain: {domain}\n\n"
|
| 132 |
+
"Respond with a single JSON object matching the Scenario schema.\n"
|
| 133 |
+
"No markdown, no explanation, only valid JSON."
|
| 134 |
+
)
|
| 135 |
+
return call_json_model(
|
| 136 |
+
self.client,
|
| 137 |
+
model=self.model,
|
| 138 |
+
system=system,
|
| 139 |
+
user=user,
|
| 140 |
+
response_model=Scenario,
|
| 141 |
+
)
|
| 142 |
+
|
| 143 |
+
def score_round(
|
| 144 |
+
self,
|
| 145 |
+
*,
|
| 146 |
+
scenario: Scenario,
|
| 147 |
+
round_number: int,
|
| 148 |
+
scientist_action: BaseModel,
|
| 149 |
+
lab_manager_response: LabManagerResponse,
|
| 150 |
+
conversation_history: list[dict],
|
| 151 |
+
current_protocol: Optional[dict],
|
| 152 |
+
previous_scores: list[AdjudicatorRoundScore],
|
| 153 |
+
) -> AdjudicatorRoundScore:
|
| 154 |
+
system = load_prompt_asset("oracle_adjudicator")
|
| 155 |
+
user = (
|
| 156 |
+
"Score this negotiation round.\n\n"
|
| 157 |
+
f"SCENARIO:\n{scenario.model_dump_json(indent=2)}\n\n"
|
| 158 |
+
f"ROUND: {round_number}\n"
|
| 159 |
+
f"SCIENTIST ACTION: {scientist_action.model_dump_json(indent=2)}\n"
|
| 160 |
+
f"LAB MANAGER RESPONSE: {lab_manager_response.model_dump_json(indent=2)}\n"
|
| 161 |
+
f"CURRENT PROTOCOL: {json.dumps(current_protocol, indent=2)}\n"
|
| 162 |
+
f"PREVIOUS SCORES: {json.dumps([score.model_dump() for score in previous_scores], indent=2)}\n\n"
|
| 163 |
+
"Respond with a single JSON object matching AdjudicatorRoundScore.\n"
|
| 164 |
+
"No markdown, no explanation, only valid JSON."
|
| 165 |
+
)
|
| 166 |
+
return call_json_model(
|
| 167 |
+
self.client,
|
| 168 |
+
model=self.model,
|
| 169 |
+
system=system,
|
| 170 |
+
user=user,
|
| 171 |
+
response_model=AdjudicatorRoundScore,
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
def score_terminal(
|
| 175 |
+
self,
|
| 176 |
+
*,
|
| 177 |
+
scenario: Scenario,
|
| 178 |
+
final_protocol: dict,
|
| 179 |
+
conversation_history: list[dict],
|
| 180 |
+
round_scores: list[AdjudicatorRoundScore],
|
| 181 |
+
) -> AdjudicatorTerminalScore:
|
| 182 |
+
system = load_prompt_asset("oracle_adjudicator")
|
| 183 |
+
user = (
|
| 184 |
+
"Compute the terminal score for this completed episode.\n\n"
|
| 185 |
+
f"SCENARIO:\n{scenario.model_dump_json(indent=2)}\n\n"
|
| 186 |
+
f"FINAL PROTOCOL: {json.dumps(final_protocol, indent=2)}\n"
|
| 187 |
+
f"CONVERSATION HISTORY: {json.dumps(conversation_history, indent=2)}\n"
|
| 188 |
+
f"ROUND SCORES: {json.dumps([score.model_dump() for score in round_scores], indent=2)}\n"
|
| 189 |
+
f"SUM OF STEP REWARDS: {sum(score.step_reward for score in round_scores)}\n\n"
|
| 190 |
+
"Respond with a single JSON object matching AdjudicatorTerminalScore.\n"
|
| 191 |
+
"No markdown, no explanation, only valid JSON."
|
| 192 |
+
)
|
| 193 |
+
return call_json_model(
|
| 194 |
+
self.client,
|
| 195 |
+
model=self.model,
|
| 196 |
+
system=system,
|
| 197 |
+
user=user,
|
| 198 |
+
response_model=AdjudicatorTerminalScore,
|
| 199 |
+
)
|
| 200 |
+
|
| 201 |
+
def maybe_inject_event(
|
| 202 |
+
self,
|
| 203 |
+
*,
|
| 204 |
+
scenario: Scenario,
|
| 205 |
+
round_number: int,
|
| 206 |
+
current_protocol: Optional[dict],
|
| 207 |
+
conversation_history: list[dict],
|
| 208 |
+
inject_enabled: bool = False,
|
| 209 |
+
) -> Optional[EnvironmentEvent]:
|
| 210 |
+
if not inject_enabled:
|
| 211 |
+
return None
|
| 212 |
+
|
| 213 |
+
system = load_prompt_asset("oracle_event_injector")
|
| 214 |
+
user = (
|
| 215 |
+
"Decide whether to inject an event this round.\n\n"
|
| 216 |
+
f"SCENARIO:\n{scenario.model_dump_json(indent=2)}\n\n"
|
| 217 |
+
f"ROUND: {round_number}\n"
|
| 218 |
+
f"CURRENT PROTOCOL: {json.dumps(current_protocol, indent=2)}\n"
|
| 219 |
+
f"CONVERSATION SO FAR: {json.dumps(conversation_history, indent=2)}\n\n"
|
| 220 |
+
'If no event is needed, respond with: {"inject": false}\n'
|
| 221 |
+
'If injecting, respond with: {"inject": true, "event": <EnvironmentEvent JSON>}\n'
|
| 222 |
+
"No markdown, no explanation, only valid JSON."
|
| 223 |
+
)
|
| 224 |
+
raw = _invoke_client(self.client, model=self.model, system=system, user=user)
|
| 225 |
+
cleaned = _strip_markdown_fences(raw)
|
| 226 |
+
data = json.loads(cleaned)
|
| 227 |
+
if not data.get("inject", False):
|
| 228 |
+
return None
|
| 229 |
+
return EnvironmentEvent.model_validate(data["event"])
|
| 230 |
+
|
| 231 |
+
def generate_post_mortem(
|
| 232 |
+
self,
|
| 233 |
+
*,
|
| 234 |
+
scenario: Scenario,
|
| 235 |
+
final_protocol: dict,
|
| 236 |
+
conversation_history: list[dict],
|
| 237 |
+
terminal_score: AdjudicatorTerminalScore,
|
| 238 |
+
) -> PostMortem:
|
| 239 |
+
system = load_prompt_asset("oracle_post_mortem")
|
| 240 |
+
user = (
|
| 241 |
+
"Generate a post-mortem analysis of this episode.\n\n"
|
| 242 |
+
f"PAPER: {scenario.paper.model_dump_json(indent=2)}\n"
|
| 243 |
+
f"LAB CONSTRAINTS: {scenario.lab_constraints.model_dump_json(indent=2)}\n"
|
| 244 |
+
f"HIDDEN SPEC: {scenario.minimum_viable_spec.model_dump_json(indent=2)}\n"
|
| 245 |
+
f"FINAL PROTOCOL: {json.dumps(final_protocol, indent=2)}\n"
|
| 246 |
+
f"CONVERSATION: {json.dumps(conversation_history, indent=2)}\n"
|
| 247 |
+
f"TERMINAL SCORE: {terminal_score.model_dump_json(indent=2)}\n\n"
|
| 248 |
+
"Respond with a single JSON object matching PostMortem.\n"
|
| 249 |
+
"No markdown, no explanation, only valid JSON."
|
| 250 |
+
)
|
| 251 |
+
return call_json_model(
|
| 252 |
+
self.client,
|
| 253 |
+
model=self.model,
|
| 254 |
+
system=system,
|
| 255 |
+
user=user,
|
| 256 |
+
response_model=PostMortem,
|
| 257 |
+
)
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
__all__ = [
|
| 261 |
+
"Oracle",
|
| 262 |
+
"call_json_model",
|
| 263 |
+
]
|
replicalab/oracle_models.py
ADDED
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Typed models for the optional Oracle-driven environment layer.
|
| 2 |
+
|
| 3 |
+
These models are additive to the existing ReplicaLab contracts. The
|
| 4 |
+
deterministic env, reward, and API surface remain canonical; Oracle models
|
| 5 |
+
power richer scenario generation, optional live Lab Manager responses,
|
| 6 |
+
optional event injection, and post-mortem analysis.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
from enum import Enum
|
| 12 |
+
from typing import Literal, Optional
|
| 13 |
+
|
| 14 |
+
from pydantic import BaseModel, ConfigDict, Field
|
| 15 |
+
|
| 16 |
+
from replicalab.models import ScientistAction
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
class Difficulty(str, Enum):
|
| 20 |
+
EASY = "easy"
|
| 21 |
+
MEDIUM = "medium"
|
| 22 |
+
HARD = "hard"
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
class Equipment(BaseModel):
|
| 26 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 27 |
+
|
| 28 |
+
name: str
|
| 29 |
+
available: bool
|
| 30 |
+
condition: str
|
| 31 |
+
booking_conflicts: list[str] = Field(default_factory=list)
|
| 32 |
+
cost_per_use: float = 0.0
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
class Reagent(BaseModel):
|
| 36 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 37 |
+
|
| 38 |
+
name: str
|
| 39 |
+
in_stock: bool
|
| 40 |
+
quantity_available: float = 0.0
|
| 41 |
+
unit: str = "mL"
|
| 42 |
+
lead_time_days: int = 0
|
| 43 |
+
cost: float = 0.0
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
class StaffMember(BaseModel):
|
| 47 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 48 |
+
|
| 49 |
+
name: str
|
| 50 |
+
role: str
|
| 51 |
+
available_days: list[str] = Field(default_factory=list)
|
| 52 |
+
skills: list[str] = Field(default_factory=list)
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
class Substitution(BaseModel):
|
| 56 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 57 |
+
|
| 58 |
+
original: str
|
| 59 |
+
substitute: str
|
| 60 |
+
validity: str
|
| 61 |
+
caveats: str = ""
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
class Paper(BaseModel):
|
| 65 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 66 |
+
|
| 67 |
+
title: str
|
| 68 |
+
domain: Literal["math_reasoning", "ml_benchmark", "finance_trading"]
|
| 69 |
+
claim: str
|
| 70 |
+
method_summary: str
|
| 71 |
+
original_sample_size: int
|
| 72 |
+
original_duration_days: int
|
| 73 |
+
original_technique: str
|
| 74 |
+
required_controls: list[str] = Field(default_factory=list)
|
| 75 |
+
required_equipment: list[str] = Field(default_factory=list)
|
| 76 |
+
required_reagents: list[str] = Field(default_factory=list)
|
| 77 |
+
statistical_test: str
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
class LabConstraints(BaseModel):
|
| 81 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 82 |
+
|
| 83 |
+
budget_total: float
|
| 84 |
+
budget_remaining: float
|
| 85 |
+
equipment: list[Equipment] = Field(default_factory=list)
|
| 86 |
+
reagents: list[Reagent] = Field(default_factory=list)
|
| 87 |
+
staff: list[StaffMember] = Field(default_factory=list)
|
| 88 |
+
max_duration_days: int
|
| 89 |
+
safety_rules: list[str] = Field(default_factory=list)
|
| 90 |
+
valid_substitutions: list[Substitution] = Field(default_factory=list)
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
class MinimumViableSpec(BaseModel):
|
| 94 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 95 |
+
|
| 96 |
+
min_sample_size: int
|
| 97 |
+
must_keep_controls: list[str] = Field(default_factory=list)
|
| 98 |
+
acceptable_techniques: list[str] = Field(default_factory=list)
|
| 99 |
+
min_duration_days: int
|
| 100 |
+
critical_equipment: list[str] = Field(default_factory=list)
|
| 101 |
+
flexible_equipment: list[str] = Field(default_factory=list)
|
| 102 |
+
critical_reagents: list[str] = Field(default_factory=list)
|
| 103 |
+
flexible_reagents: list[str] = Field(default_factory=list)
|
| 104 |
+
power_threshold: float
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
class Scenario(BaseModel):
|
| 108 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 109 |
+
|
| 110 |
+
paper: Paper
|
| 111 |
+
lab_constraints: LabConstraints
|
| 112 |
+
minimum_viable_spec: MinimumViableSpec
|
| 113 |
+
difficulty: Difficulty
|
| 114 |
+
narrative_hook: str
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
class OracleScientistObservation(BaseModel):
|
| 118 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 119 |
+
|
| 120 |
+
paper: Paper
|
| 121 |
+
round_number: int
|
| 122 |
+
max_rounds: int
|
| 123 |
+
conversation_history: list[dict] = Field(default_factory=list)
|
| 124 |
+
current_protocol: Optional[dict] = None
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
class OracleLabManagerObservation(BaseModel):
|
| 128 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 129 |
+
|
| 130 |
+
lab_constraints: LabConstraints
|
| 131 |
+
current_protocol: Optional[dict] = None
|
| 132 |
+
scientist_action: ScientistAction
|
| 133 |
+
round_number: int
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
class LabManagerResponse(BaseModel):
|
| 137 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 138 |
+
|
| 139 |
+
response_type: Literal[
|
| 140 |
+
"feasibility_report",
|
| 141 |
+
"suggest_substitution",
|
| 142 |
+
"reject",
|
| 143 |
+
"accept",
|
| 144 |
+
]
|
| 145 |
+
feasible: bool
|
| 146 |
+
issues: list[str] = Field(default_factory=list)
|
| 147 |
+
suggestions: list[str] = Field(default_factory=list)
|
| 148 |
+
cost_estimate: float = 0.0
|
| 149 |
+
time_estimate_days: int = 0
|
| 150 |
+
message: str
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
class AdjudicatorRoundScore(BaseModel):
|
| 154 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 155 |
+
|
| 156 |
+
rigor_flags: list[str] = Field(default_factory=list)
|
| 157 |
+
feasibility_flags: list[str] = Field(default_factory=list)
|
| 158 |
+
info_gain: float
|
| 159 |
+
protocol_delta: float
|
| 160 |
+
momentum: float
|
| 161 |
+
contradiction_detected: bool
|
| 162 |
+
stalling_detected: bool
|
| 163 |
+
step_reward: float
|
| 164 |
+
notes: str
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
class AdjudicatorTerminalScore(BaseModel):
|
| 168 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 169 |
+
|
| 170 |
+
rigor: float
|
| 171 |
+
feasibility: float
|
| 172 |
+
fidelity: float
|
| 173 |
+
parsimony: float
|
| 174 |
+
robustness: float
|
| 175 |
+
power_preservation: float
|
| 176 |
+
efficiency_bonus: float
|
| 177 |
+
communication_bonus: float
|
| 178 |
+
penalties: dict[str, float] = Field(default_factory=dict)
|
| 179 |
+
terminal_reward: float
|
| 180 |
+
total_reward: float
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
class EnvironmentEvent(BaseModel):
|
| 184 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 185 |
+
|
| 186 |
+
event_type: str
|
| 187 |
+
description: str
|
| 188 |
+
state_changes: dict[str, object] = Field(default_factory=dict)
|
| 189 |
+
severity: Literal["minor", "moderate", "major"]
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
class PostMortem(BaseModel):
|
| 193 |
+
model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
|
| 194 |
+
|
| 195 |
+
overall_summary: str
|
| 196 |
+
rigor_explanation: str
|
| 197 |
+
feasibility_explanation: str
|
| 198 |
+
fidelity_explanation: str
|
| 199 |
+
key_decisions: list[str] = Field(default_factory=list)
|
| 200 |
+
missed_opportunities: list[str] = Field(default_factory=list)
|
| 201 |
+
comparison_note: str
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
__all__ = [
|
| 205 |
+
"AdjudicatorRoundScore",
|
| 206 |
+
"AdjudicatorTerminalScore",
|
| 207 |
+
"Difficulty",
|
| 208 |
+
"EnvironmentEvent",
|
| 209 |
+
"Equipment",
|
| 210 |
+
"LabConstraints",
|
| 211 |
+
"LabManagerResponse",
|
| 212 |
+
"MinimumViableSpec",
|
| 213 |
+
"OracleLabManagerObservation",
|
| 214 |
+
"OracleScientistObservation",
|
| 215 |
+
"Paper",
|
| 216 |
+
"PostMortem",
|
| 217 |
+
"Reagent",
|
| 218 |
+
"Scenario",
|
| 219 |
+
"StaffMember",
|
| 220 |
+
"Substitution",
|
| 221 |
+
]
|
replicalab/prompts/__init__.py
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
"""Prompt template assets and render helpers
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
|
@@ -13,11 +13,17 @@ PromptRole = Literal["scientist", "lab_manager", "judge"]
|
|
| 13 |
_PROMPTS_DIR = Path(__file__).resolve().parent
|
| 14 |
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
def load_prompt_template(role: PromptRole) -> str:
|
| 17 |
"""Load a role prompt template from disk."""
|
| 18 |
|
| 19 |
-
|
| 20 |
-
return path.read_text(encoding="utf-8")
|
| 21 |
|
| 22 |
|
| 23 |
def render_prompt_template(
|
|
@@ -119,6 +125,7 @@ def _render_substitutions(pack: NormalizedScenarioPack) -> str:
|
|
| 119 |
|
| 120 |
__all__ = [
|
| 121 |
"PromptRole",
|
|
|
|
| 122 |
"load_prompt_template",
|
| 123 |
"render_prompt_template",
|
| 124 |
"render_scientist_prompt",
|
|
|
|
| 1 |
+
"""Prompt template assets and render helpers."""
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
|
|
|
| 13 |
_PROMPTS_DIR = Path(__file__).resolve().parent
|
| 14 |
|
| 15 |
|
| 16 |
+
def load_prompt_asset(name: str) -> str:
|
| 17 |
+
"""Load any prompt asset by filename stem."""
|
| 18 |
+
|
| 19 |
+
path = _PROMPTS_DIR / f"{name}.txt"
|
| 20 |
+
return path.read_text(encoding="utf-8")
|
| 21 |
+
|
| 22 |
+
|
| 23 |
def load_prompt_template(role: PromptRole) -> str:
|
| 24 |
"""Load a role prompt template from disk."""
|
| 25 |
|
| 26 |
+
return load_prompt_asset(role)
|
|
|
|
| 27 |
|
| 28 |
|
| 29 |
def render_prompt_template(
|
|
|
|
| 125 |
|
| 126 |
__all__ = [
|
| 127 |
"PromptRole",
|
| 128 |
+
"load_prompt_asset",
|
| 129 |
"load_prompt_template",
|
| 130 |
"render_prompt_template",
|
| 131 |
"render_scientist_prompt",
|
replicalab/prompts/oracle_adjudicator.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are the Dynamic Adjudicator for ReplicaLab.
|
| 2 |
+
|
| 3 |
+
You evaluate each round of negotiation and can also produce a terminal
|
| 4 |
+
summary score object. Be precise, fair, and consistent.
|
| 5 |
+
|
| 6 |
+
Round scoring:
|
| 7 |
+
- info_gain (0-1): how much new useful information the Scientist extracted
|
| 8 |
+
- protocol_delta (-1 to 1): did the protocol move closer to or further from a viable plan
|
| 9 |
+
- momentum (0-1): did the Scientist respond productively to feedback
|
| 10 |
+
- contradiction_detected: did the Scientist contradict previously revealed constraints
|
| 11 |
+
- stalling_detected: did the Scientist repeat prior actions or already-answered questions
|
| 12 |
+
- step_reward: combine the above into a small shaped score
|
| 13 |
+
|
| 14 |
+
Terminal scoring:
|
| 15 |
+
- rigor, feasibility, fidelity, parsimony, robustness, power_preservation
|
| 16 |
+
- efficiency_bonus and communication_bonus
|
| 17 |
+
- penalties with named keys only
|
| 18 |
+
- terminal_reward and total_reward
|
| 19 |
+
|
| 20 |
+
Important:
|
| 21 |
+
- Do not invent new score dimensions outside the schema.
|
| 22 |
+
- Score against the hidden scenario specification, not your personal preference.
|
| 23 |
+
- Reward fields must be numerically coherent and self-consistent.
|
| 24 |
+
|
| 25 |
+
Respond ONLY with valid JSON matching the requested adjudicator schema.
|
| 26 |
+
No markdown. No explanation. No extra text.
|
replicalab/prompts/oracle_event_injector.txt
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are the Event Injector for ReplicaLab.
|
| 2 |
+
|
| 3 |
+
After a negotiation round, decide whether to inject a realistic mid-episode
|
| 4 |
+
perturbation. Inject sparingly.
|
| 5 |
+
|
| 6 |
+
Rules:
|
| 7 |
+
- Never inject more than one event per episode.
|
| 8 |
+
- Never inject in rounds 1 or 2.
|
| 9 |
+
- Only inject if the negotiation is stagnating or the protocol is too comfortable.
|
| 10 |
+
- Events must be survivable. There must remain a path to a decent outcome.
|
| 11 |
+
- Use realistic events only: budget cuts, equipment failure, maintenance, backorders, scope changes, staff unavailability.
|
| 12 |
+
- state_changes must be a flat dictionary of dotted paths to new values.
|
| 13 |
+
|
| 14 |
+
If no event is needed, respond with:
|
| 15 |
+
{"inject": false}
|
| 16 |
+
|
| 17 |
+
If injecting an event, respond with:
|
| 18 |
+
{"inject": true, "event": <EnvironmentEvent JSON>}
|
| 19 |
+
|
| 20 |
+
No markdown. No explanation. No extra text.
|
replicalab/prompts/oracle_lab_manager.txt
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are a Lab Manager at a research institution. You are practical,
|
| 2 |
+
detail-oriented, and protective of your lab's resources.
|
| 3 |
+
|
| 4 |
+
You have access to a constraint document that describes your lab's exact
|
| 5 |
+
situation: budget, equipment, reagents, staff, bookings, and safety rules.
|
| 6 |
+
This document is ground truth. Do not invent constraints that are not in it,
|
| 7 |
+
and do not ignore constraints that are.
|
| 8 |
+
|
| 9 |
+
When a Scientist proposes a protocol or asks a question:
|
| 10 |
+
1. Check every element against your constraints.
|
| 11 |
+
2. Report what is feasible and what is not.
|
| 12 |
+
3. If something is not feasible, suggest a concrete alternative if one exists.
|
| 13 |
+
4. Estimate the cost and time for what is proposed.
|
| 14 |
+
5. Be collaborative but honest. Do not agree to things the lab cannot do.
|
| 15 |
+
|
| 16 |
+
Respond ONLY with valid JSON matching LabManagerResponse.
|
| 17 |
+
No markdown. No preamble. No extra text.
|
replicalab/prompts/oracle_post_mortem.txt
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are the Post-Mortem Analyst for ReplicaLab.
|
| 2 |
+
|
| 3 |
+
At episode end, explain the outcome clearly and specifically.
|
| 4 |
+
|
| 5 |
+
Your explanation must:
|
| 6 |
+
- summarize the episode in 2-3 sentences
|
| 7 |
+
- explain rigor, feasibility, and fidelity using concrete choices from the protocol
|
| 8 |
+
- identify 3 to 5 impactful decisions
|
| 9 |
+
- list missed opportunities
|
| 10 |
+
- compare the final protocol to what an optimal Scientist would likely have done
|
| 11 |
+
|
| 12 |
+
Be specific and evidence-based. Refer to protocol decisions, constraints, and final scores.
|
| 13 |
+
|
| 14 |
+
Respond ONLY with valid JSON matching the PostMortem schema.
|
| 15 |
+
No markdown. No explanation. No extra text.
|
replicalab/prompts/oracle_world_architect.txt
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are the World Architect for ReplicaLab.
|
| 2 |
+
|
| 3 |
+
You generate a complete, internally consistent scenario for one episode.
|
| 4 |
+
You receive a seed, difficulty level, and domain.
|
| 5 |
+
|
| 6 |
+
You must produce:
|
| 7 |
+
1. A realistic research or benchmark paper/specification with a clear claim
|
| 8 |
+
2. Lab or compute constraints that create real tension with the requirements
|
| 9 |
+
3. A hidden minimum viable replication spec that is achievable under the constraints
|
| 10 |
+
4. Valid substitutions that are scientifically or operationally defensible
|
| 11 |
+
|
| 12 |
+
Rules:
|
| 13 |
+
- Supported domains are math_reasoning, ml_benchmark, and finance_trading.
|
| 14 |
+
- The scenario must be solvable. There must always be a viable path to a reasonable outcome.
|
| 15 |
+
- Difficulty controls conflict density:
|
| 16 |
+
easy = 1-2 meaningful conflicts
|
| 17 |
+
medium = 3-4 meaningful conflicts
|
| 18 |
+
hard = 5 or more meaningful conflicts
|
| 19 |
+
- Budget, duration, staff skills, and substitutions must be realistic for the domain.
|
| 20 |
+
- Generate a short narrative_hook that helps the UI explain why this scenario is interesting.
|
| 21 |
+
|
| 22 |
+
Respond ONLY with valid JSON matching the Scenario schema.
|
| 23 |
+
No markdown. No explanation. No extra text.
|
replicalab/scenarios/__init__.py
CHANGED
|
@@ -12,6 +12,7 @@ from .templates import (
|
|
| 12 |
apply_difficulty,
|
| 13 |
generate_scenario,
|
| 14 |
load_template,
|
|
|
|
| 15 |
)
|
| 16 |
|
| 17 |
__all__ = [
|
|
@@ -26,4 +27,5 @@ __all__ = [
|
|
| 26 |
"apply_difficulty",
|
| 27 |
"generate_scenario",
|
| 28 |
"load_template",
|
|
|
|
| 29 |
]
|
|
|
|
| 12 |
apply_difficulty,
|
| 13 |
generate_scenario,
|
| 14 |
load_template,
|
| 15 |
+
oracle_scenario_to_normalized_pack,
|
| 16 |
)
|
| 17 |
|
| 18 |
__all__ = [
|
|
|
|
| 27 |
"apply_difficulty",
|
| 28 |
"generate_scenario",
|
| 29 |
"load_template",
|
| 30 |
+
"oracle_scenario_to_normalized_pack",
|
| 31 |
]
|
replicalab/scenarios/templates.py
CHANGED
|
@@ -10,6 +10,7 @@ from pydantic import BaseModel, ConfigDict
|
|
| 10 |
|
| 11 |
from replicalab.config import MAX_BUDGET, MAX_ROUNDS
|
| 12 |
from replicalab.models import LabManagerObservation, ScientistObservation
|
|
|
|
| 13 |
from replicalab.scenarios.finance_trading import build_finance_trading_template
|
| 14 |
from replicalab.scenarios.math_reasoning import build_math_reasoning_template
|
| 15 |
from replicalab.scenarios.ml_benchmark import build_ml_benchmark_template
|
|
@@ -185,6 +186,224 @@ def generate_scenario(
|
|
| 185 |
return _build_pack(seed=seed, template=template, draft=scaled, rng=rng)
|
| 186 |
|
| 187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
def _build_pack(seed: int, template: TemplateName, draft: dict[str, Any], rng: Any) -> NormalizedScenarioPack:
|
| 189 |
constraints = [ScenarioConstraint.model_validate(item) for item in draft["constraints"]]
|
| 190 |
resources = [ScenarioResource.model_validate(item) for item in draft["resources"]]
|
|
@@ -284,6 +503,90 @@ def _build_pack(seed: int, template: TemplateName, draft: dict[str, Any], rng: A
|
|
| 284 |
)
|
| 285 |
|
| 286 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
def _split_resources(
|
| 288 |
resources: list[ScenarioResource],
|
| 289 |
*,
|
|
|
|
| 10 |
|
| 11 |
from replicalab.config import MAX_BUDGET, MAX_ROUNDS
|
| 12 |
from replicalab.models import LabManagerObservation, ScientistObservation
|
| 13 |
+
from replicalab.oracle_models import Scenario as OracleScenario
|
| 14 |
from replicalab.scenarios.finance_trading import build_finance_trading_template
|
| 15 |
from replicalab.scenarios.math_reasoning import build_math_reasoning_template
|
| 16 |
from replicalab.scenarios.ml_benchmark import build_ml_benchmark_template
|
|
|
|
| 186 |
return _build_pack(seed=seed, template=template, draft=scaled, rng=rng)
|
| 187 |
|
| 188 |
|
| 189 |
+
def oracle_scenario_to_normalized_pack(
|
| 190 |
+
*,
|
| 191 |
+
seed: int,
|
| 192 |
+
template: TemplateName,
|
| 193 |
+
oracle_scenario: OracleScenario,
|
| 194 |
+
max_rounds: int = MAX_ROUNDS,
|
| 195 |
+
) -> NormalizedScenarioPack:
|
| 196 |
+
"""Adapt an Oracle-generated Scenario into the canonical normalized pack."""
|
| 197 |
+
|
| 198 |
+
difficulty = oracle_scenario.difficulty.value
|
| 199 |
+
budget_total = oracle_scenario.lab_constraints.budget_total
|
| 200 |
+
budget_remaining = oracle_scenario.lab_constraints.budget_remaining
|
| 201 |
+
time_limit_days = oracle_scenario.lab_constraints.max_duration_days
|
| 202 |
+
staff_count = len(oracle_scenario.lab_constraints.staff)
|
| 203 |
+
|
| 204 |
+
constraints: list[ScenarioConstraint] = [
|
| 205 |
+
ScenarioConstraint(
|
| 206 |
+
key="budget_total",
|
| 207 |
+
label="Budget total",
|
| 208 |
+
quantity=budget_total,
|
| 209 |
+
unit="USD",
|
| 210 |
+
comparator="<=",
|
| 211 |
+
hard=True,
|
| 212 |
+
details=f"Total available budget is {budget_total:.2f} USD.",
|
| 213 |
+
),
|
| 214 |
+
ScenarioConstraint(
|
| 215 |
+
key="budget_remaining",
|
| 216 |
+
label="Budget remaining",
|
| 217 |
+
quantity=budget_remaining,
|
| 218 |
+
unit="USD",
|
| 219 |
+
comparator="<=",
|
| 220 |
+
hard=True,
|
| 221 |
+
details=f"Remaining budget at episode start is {budget_remaining:.2f} USD.",
|
| 222 |
+
),
|
| 223 |
+
ScenarioConstraint(
|
| 224 |
+
key="max_duration_days",
|
| 225 |
+
label="Maximum duration",
|
| 226 |
+
quantity=time_limit_days,
|
| 227 |
+
unit="days",
|
| 228 |
+
comparator="<=",
|
| 229 |
+
hard=True,
|
| 230 |
+
details=f"The plan must finish within {time_limit_days} days.",
|
| 231 |
+
),
|
| 232 |
+
ScenarioConstraint(
|
| 233 |
+
key="staff_count",
|
| 234 |
+
label="Available staff",
|
| 235 |
+
quantity=staff_count,
|
| 236 |
+
unit="people",
|
| 237 |
+
comparator=">=",
|
| 238 |
+
hard=True,
|
| 239 |
+
details=f"{staff_count} staff member(s) are available for this scenario.",
|
| 240 |
+
),
|
| 241 |
+
]
|
| 242 |
+
constraints.extend(
|
| 243 |
+
ScenarioConstraint(
|
| 244 |
+
key=f"safety_rule_{index + 1}",
|
| 245 |
+
label=f"Safety rule {index + 1}",
|
| 246 |
+
comparator="=",
|
| 247 |
+
hard=True,
|
| 248 |
+
details=rule,
|
| 249 |
+
)
|
| 250 |
+
for index, rule in enumerate(oracle_scenario.lab_constraints.safety_rules)
|
| 251 |
+
)
|
| 252 |
+
|
| 253 |
+
resources: list[ScenarioResource] = []
|
| 254 |
+
for equipment in oracle_scenario.lab_constraints.equipment:
|
| 255 |
+
category = (
|
| 256 |
+
"compute"
|
| 257 |
+
if any(token in equipment.name.lower() for token in ("gpu", "cluster", "accelerator"))
|
| 258 |
+
else "tool"
|
| 259 |
+
)
|
| 260 |
+
resources.append(
|
| 261 |
+
ScenarioResource(
|
| 262 |
+
key=_slug(equipment.name),
|
| 263 |
+
label=equipment.name,
|
| 264 |
+
quantity=1,
|
| 265 |
+
unit="unit",
|
| 266 |
+
available=equipment.available and equipment.condition != "shared_booking",
|
| 267 |
+
category=category,
|
| 268 |
+
details=(
|
| 269 |
+
f"Condition: {equipment.condition}. "
|
| 270 |
+
f"Booking conflicts: {', '.join(equipment.booking_conflicts) or 'none'}."
|
| 271 |
+
),
|
| 272 |
+
)
|
| 273 |
+
)
|
| 274 |
+
|
| 275 |
+
for reagent in oracle_scenario.lab_constraints.reagents:
|
| 276 |
+
resources.append(
|
| 277 |
+
ScenarioResource(
|
| 278 |
+
key=_slug(reagent.name),
|
| 279 |
+
label=reagent.name,
|
| 280 |
+
quantity=reagent.quantity_available,
|
| 281 |
+
unit=reagent.unit,
|
| 282 |
+
available=reagent.in_stock,
|
| 283 |
+
category="reference",
|
| 284 |
+
details=(
|
| 285 |
+
f"Lead time: {reagent.lead_time_days} day(s). "
|
| 286 |
+
f"Unit cost: {reagent.cost:.2f}."
|
| 287 |
+
),
|
| 288 |
+
)
|
| 289 |
+
)
|
| 290 |
+
|
| 291 |
+
for member in oracle_scenario.lab_constraints.staff:
|
| 292 |
+
resources.append(
|
| 293 |
+
ScenarioResource(
|
| 294 |
+
key=_slug(member.name),
|
| 295 |
+
label=member.name,
|
| 296 |
+
quantity=len(member.available_days),
|
| 297 |
+
unit="days",
|
| 298 |
+
available=bool(member.available_days),
|
| 299 |
+
category="personnel",
|
| 300 |
+
details=f"Role: {member.role}. Skills: {', '.join(member.skills) or 'generalist'}.",
|
| 301 |
+
)
|
| 302 |
+
)
|
| 303 |
+
|
| 304 |
+
substitutions = [
|
| 305 |
+
AllowedSubstitution(
|
| 306 |
+
original=item.original,
|
| 307 |
+
alternative=item.substitute,
|
| 308 |
+
condition=item.validity,
|
| 309 |
+
tradeoff=item.caveats or item.validity,
|
| 310 |
+
)
|
| 311 |
+
for item in oracle_scenario.lab_constraints.valid_substitutions
|
| 312 |
+
]
|
| 313 |
+
|
| 314 |
+
required_elements = (
|
| 315 |
+
list(oracle_scenario.minimum_viable_spec.must_keep_controls)
|
| 316 |
+
+ list(oracle_scenario.minimum_viable_spec.critical_equipment)
|
| 317 |
+
+ list(oracle_scenario.minimum_viable_spec.critical_reagents)
|
| 318 |
+
)
|
| 319 |
+
flexible_elements = (
|
| 320 |
+
list(oracle_scenario.minimum_viable_spec.acceptable_techniques)
|
| 321 |
+
+ list(oracle_scenario.minimum_viable_spec.flexible_equipment)
|
| 322 |
+
+ list(oracle_scenario.minimum_viable_spec.flexible_reagents)
|
| 323 |
+
)
|
| 324 |
+
|
| 325 |
+
hidden_reference = HiddenReferenceSpec(
|
| 326 |
+
summary=oracle_scenario.paper.method_summary,
|
| 327 |
+
required_elements=required_elements,
|
| 328 |
+
flexible_elements=flexible_elements,
|
| 329 |
+
target_metric=oracle_scenario.paper.statistical_test,
|
| 330 |
+
target_value=f"power>={oracle_scenario.minimum_viable_spec.power_threshold:.2f}",
|
| 331 |
+
)
|
| 332 |
+
|
| 333 |
+
success_criteria = [
|
| 334 |
+
oracle_scenario.paper.claim,
|
| 335 |
+
f"Preserve controls: {', '.join(oracle_scenario.paper.required_controls) or 'none listed'}",
|
| 336 |
+
f"Use an acceptable technique from the viable spec where possible.",
|
| 337 |
+
f"Stay within {budget_total:.2f} USD and {time_limit_days} days.",
|
| 338 |
+
]
|
| 339 |
+
|
| 340 |
+
equipment_available = [
|
| 341 |
+
equipment.name
|
| 342 |
+
for equipment in oracle_scenario.lab_constraints.equipment
|
| 343 |
+
if equipment.available and equipment.condition != "shared_booking"
|
| 344 |
+
]
|
| 345 |
+
equipment_booked = [
|
| 346 |
+
equipment.name
|
| 347 |
+
for equipment in oracle_scenario.lab_constraints.equipment
|
| 348 |
+
if not equipment.available or equipment.condition == "shared_booking"
|
| 349 |
+
]
|
| 350 |
+
reagents_in_stock = [
|
| 351 |
+
reagent.name for reagent in oracle_scenario.lab_constraints.reagents if reagent.in_stock
|
| 352 |
+
]
|
| 353 |
+
reagents_out_of_stock = [
|
| 354 |
+
reagent.name for reagent in oracle_scenario.lab_constraints.reagents if not reagent.in_stock
|
| 355 |
+
]
|
| 356 |
+
|
| 357 |
+
scientist_observation = ScientistObservation(
|
| 358 |
+
paper_title=oracle_scenario.paper.title,
|
| 359 |
+
paper_hypothesis=oracle_scenario.paper.claim,
|
| 360 |
+
paper_method=oracle_scenario.paper.method_summary,
|
| 361 |
+
paper_key_finding=oracle_scenario.narrative_hook,
|
| 362 |
+
experiment_goal=oracle_scenario.paper.claim,
|
| 363 |
+
conversation_history=[],
|
| 364 |
+
current_protocol=None,
|
| 365 |
+
round_number=0,
|
| 366 |
+
max_rounds=max_rounds,
|
| 367 |
+
)
|
| 368 |
+
|
| 369 |
+
lab_manager_observation = LabManagerObservation(
|
| 370 |
+
budget_total=budget_total,
|
| 371 |
+
budget_remaining=budget_remaining,
|
| 372 |
+
equipment_available=equipment_available,
|
| 373 |
+
equipment_booked=equipment_booked,
|
| 374 |
+
reagents_in_stock=reagents_in_stock,
|
| 375 |
+
reagents_out_of_stock=reagents_out_of_stock,
|
| 376 |
+
staff_count=staff_count,
|
| 377 |
+
time_limit_days=time_limit_days,
|
| 378 |
+
safety_restrictions=list(oracle_scenario.lab_constraints.safety_rules),
|
| 379 |
+
conversation_history=[],
|
| 380 |
+
current_protocol=None,
|
| 381 |
+
round_number=0,
|
| 382 |
+
max_rounds=max_rounds,
|
| 383 |
+
)
|
| 384 |
+
|
| 385 |
+
bookings = _oracle_bookings(oracle_scenario)
|
| 386 |
+
windows = _oracle_windows(oracle_scenario)
|
| 387 |
+
|
| 388 |
+
return NormalizedScenarioPack(
|
| 389 |
+
scenario_id=f"{template}-{difficulty}-{seed}-oracle",
|
| 390 |
+
template=template,
|
| 391 |
+
domain_id=oracle_scenario.paper.domain,
|
| 392 |
+
difficulty=difficulty,
|
| 393 |
+
seed=seed,
|
| 394 |
+
task_summary=oracle_scenario.paper.claim,
|
| 395 |
+
success_criteria=success_criteria,
|
| 396 |
+
constraints=constraints,
|
| 397 |
+
resources=resources,
|
| 398 |
+
allowed_substitutions=substitutions,
|
| 399 |
+
hidden_reference_spec=hidden_reference,
|
| 400 |
+
scientist_observation=scientist_observation,
|
| 401 |
+
lab_manager_observation=lab_manager_observation,
|
| 402 |
+
resource_bookings=bookings,
|
| 403 |
+
scheduling_windows=windows,
|
| 404 |
+
)
|
| 405 |
+
|
| 406 |
+
|
| 407 |
def _build_pack(seed: int, template: TemplateName, draft: dict[str, Any], rng: Any) -> NormalizedScenarioPack:
|
| 408 |
constraints = [ScenarioConstraint.model_validate(item) for item in draft["constraints"]]
|
| 409 |
resources = [ScenarioResource.model_validate(item) for item in draft["resources"]]
|
|
|
|
| 503 |
)
|
| 504 |
|
| 505 |
|
| 506 |
+
def _slug(value: str) -> str:
|
| 507 |
+
return "_".join(value.lower().replace("/", " ").replace("-", " ").split())
|
| 508 |
+
|
| 509 |
+
|
| 510 |
+
def _day_to_offset(day: str) -> int:
|
| 511 |
+
mapping = {
|
| 512 |
+
"monday": 0,
|
| 513 |
+
"tuesday": 24,
|
| 514 |
+
"wednesday": 48,
|
| 515 |
+
"thursday": 72,
|
| 516 |
+
"friday": 96,
|
| 517 |
+
"saturday": 120,
|
| 518 |
+
"sunday": 144,
|
| 519 |
+
}
|
| 520 |
+
return mapping.get(day.strip().lower(), 0)
|
| 521 |
+
|
| 522 |
+
|
| 523 |
+
def _oracle_bookings(oracle_scenario: OracleScenario) -> list[ResourceBooking]:
|
| 524 |
+
bookings: list[ResourceBooking] = []
|
| 525 |
+
for equipment in oracle_scenario.lab_constraints.equipment:
|
| 526 |
+
if equipment.booking_conflicts:
|
| 527 |
+
for day in equipment.booking_conflicts:
|
| 528 |
+
bookings.append(
|
| 529 |
+
ResourceBooking(
|
| 530 |
+
resource_key=_slug(equipment.name),
|
| 531 |
+
resource_label=equipment.name,
|
| 532 |
+
slot_label=day,
|
| 533 |
+
start_offset_hours=_day_to_offset(day),
|
| 534 |
+
duration_hours=8.0,
|
| 535 |
+
status="booked" if equipment.available else "maintenance",
|
| 536 |
+
details=f"{equipment.name} is constrained on {day}.",
|
| 537 |
+
)
|
| 538 |
+
)
|
| 539 |
+
else:
|
| 540 |
+
bookings.append(
|
| 541 |
+
ResourceBooking(
|
| 542 |
+
resource_key=_slug(equipment.name),
|
| 543 |
+
resource_label=equipment.name,
|
| 544 |
+
slot_label="default",
|
| 545 |
+
start_offset_hours=0.0,
|
| 546 |
+
duration_hours=8.0,
|
| 547 |
+
status="available" if equipment.available else "maintenance",
|
| 548 |
+
details=f"{equipment.name} is available under normal scheduling.",
|
| 549 |
+
)
|
| 550 |
+
)
|
| 551 |
+
return bookings
|
| 552 |
+
|
| 553 |
+
|
| 554 |
+
def _oracle_windows(oracle_scenario: OracleScenario) -> list[SchedulingWindow]:
|
| 555 |
+
windows: list[SchedulingWindow] = [
|
| 556 |
+
SchedulingWindow(
|
| 557 |
+
key="max_duration_window",
|
| 558 |
+
label="Maximum project duration",
|
| 559 |
+
start_offset_hours=0.0,
|
| 560 |
+
end_offset_hours=float(oracle_scenario.lab_constraints.max_duration_days * 24),
|
| 561 |
+
hard=True,
|
| 562 |
+
details=(
|
| 563 |
+
f"All work must complete within "
|
| 564 |
+
f"{oracle_scenario.lab_constraints.max_duration_days} days."
|
| 565 |
+
),
|
| 566 |
+
)
|
| 567 |
+
]
|
| 568 |
+
|
| 569 |
+
seen_days: set[str] = set()
|
| 570 |
+
for member in oracle_scenario.lab_constraints.staff:
|
| 571 |
+
for day in member.available_days:
|
| 572 |
+
normalized = day.strip().lower()
|
| 573 |
+
if normalized in seen_days:
|
| 574 |
+
continue
|
| 575 |
+
seen_days.add(normalized)
|
| 576 |
+
start = float(_day_to_offset(day))
|
| 577 |
+
windows.append(
|
| 578 |
+
SchedulingWindow(
|
| 579 |
+
key=f"staff_{normalized}",
|
| 580 |
+
label=f"Staff availability {day}",
|
| 581 |
+
start_offset_hours=start,
|
| 582 |
+
end_offset_hours=start + 8.0,
|
| 583 |
+
hard=False,
|
| 584 |
+
details=f"At least one staff member is available on {day}.",
|
| 585 |
+
)
|
| 586 |
+
)
|
| 587 |
+
return windows
|
| 588 |
+
|
| 589 |
+
|
| 590 |
def _split_resources(
|
| 591 |
resources: list[ScenarioResource],
|
| 592 |
*,
|
replicalab/scoring/explain.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
-
"""JDG 06
|
| 2 |
|
| 3 |
-
Pure deterministic function
|
| 4 |
introduces no new scoring logic.
|
| 5 |
"""
|
| 6 |
|
|
@@ -22,33 +22,31 @@ def _tier(score: float) -> str:
|
|
| 22 |
|
| 23 |
|
| 24 |
def explain_reward(breakdown: RewardBreakdown) -> str:
|
| 25 |
-
"""Build a plain-English explanation from a RewardBreakdown.
|
| 26 |
-
|
| 27 |
-
The output mirrors the three rubric components (rigor, feasibility,
|
| 28 |
-
fidelity), any bonuses, any named penalties, and the final total.
|
| 29 |
-
No hidden scoring logic is introduced — this is a pure formatter.
|
| 30 |
-
"""
|
| 31 |
total = compute_total_reward(breakdown)
|
| 32 |
lines: list[str] = []
|
| 33 |
|
| 34 |
-
# --- rubric components ---
|
| 35 |
lines.append(
|
| 36 |
-
f"Rigor: {breakdown.rigor:.2f} ({_tier(breakdown.rigor)})
|
| 37 |
"measures structural completeness, success-criteria coverage, "
|
| 38 |
"and required-element coverage."
|
| 39 |
)
|
| 40 |
lines.append(
|
| 41 |
-
f"Feasibility: {breakdown.feasibility:.2f} ({_tier(breakdown.feasibility)})
|
| 42 |
"measures whether the protocol respects budget, equipment, reagent, "
|
| 43 |
"schedule, and staffing constraints."
|
| 44 |
)
|
| 45 |
lines.append(
|
| 46 |
-
f"Fidelity: {breakdown.fidelity:.2f} ({_tier(breakdown.fidelity)})
|
| 47 |
"measures alignment with the hidden reference spec, including "
|
| 48 |
"required elements, substitutions, and target metrics."
|
| 49 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
# --- bonuses ---
|
| 52 |
if breakdown.efficiency_bonus > 0:
|
| 53 |
lines.append(
|
| 54 |
f"Efficiency bonus: +{breakdown.efficiency_bonus:.2f} "
|
|
@@ -59,15 +57,17 @@ def explain_reward(breakdown: RewardBreakdown) -> str:
|
|
| 59 |
f"Communication bonus: +{breakdown.communication_bonus:.2f}."
|
| 60 |
)
|
| 61 |
|
| 62 |
-
# --- penalties ---
|
| 63 |
if breakdown.penalties:
|
| 64 |
for key, amount in sorted(breakdown.penalties.items()):
|
| 65 |
label = key.replace("_", " ")
|
| 66 |
-
lines.append(f"Penalty
|
| 67 |
else:
|
| 68 |
lines.append("No penalties applied.")
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
return "\n".join(lines)
|
|
|
|
| 1 |
+
"""JDG 06 - Plain-English explanation builder from RewardBreakdown.
|
| 2 |
|
| 3 |
+
Pure deterministic function - reads existing breakdown fields only,
|
| 4 |
introduces no new scoring logic.
|
| 5 |
"""
|
| 6 |
|
|
|
|
| 22 |
|
| 23 |
|
| 24 |
def explain_reward(breakdown: RewardBreakdown) -> str:
|
| 25 |
+
"""Build a plain-English explanation from a RewardBreakdown."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
total = compute_total_reward(breakdown)
|
| 27 |
lines: list[str] = []
|
| 28 |
|
|
|
|
| 29 |
lines.append(
|
| 30 |
+
f"Rigor: {breakdown.rigor:.2f} ({_tier(breakdown.rigor)}) - "
|
| 31 |
"measures structural completeness, success-criteria coverage, "
|
| 32 |
"and required-element coverage."
|
| 33 |
)
|
| 34 |
lines.append(
|
| 35 |
+
f"Feasibility: {breakdown.feasibility:.2f} ({_tier(breakdown.feasibility)}) - "
|
| 36 |
"measures whether the protocol respects budget, equipment, reagent, "
|
| 37 |
"schedule, and staffing constraints."
|
| 38 |
)
|
| 39 |
lines.append(
|
| 40 |
+
f"Fidelity: {breakdown.fidelity:.2f} ({_tier(breakdown.fidelity)}) - "
|
| 41 |
"measures alignment with the hidden reference spec, including "
|
| 42 |
"required elements, substitutions, and target metrics."
|
| 43 |
)
|
| 44 |
+
lines.append(
|
| 45 |
+
f"Parsimony: {breakdown.parsimony:.2f} ({_tier(breakdown.parsimony)}) - "
|
| 46 |
+
"measures whether the plan stays lean instead of requesting more "
|
| 47 |
+
"controls, equipment, or reagents than the scenario complexity calls for."
|
| 48 |
+
)
|
| 49 |
|
|
|
|
| 50 |
if breakdown.efficiency_bonus > 0:
|
| 51 |
lines.append(
|
| 52 |
f"Efficiency bonus: +{breakdown.efficiency_bonus:.2f} "
|
|
|
|
| 57 |
f"Communication bonus: +{breakdown.communication_bonus:.2f}."
|
| 58 |
)
|
| 59 |
|
|
|
|
| 60 |
if breakdown.penalties:
|
| 61 |
for key, amount in sorted(breakdown.penalties.items()):
|
| 62 |
label = key.replace("_", " ")
|
| 63 |
+
lines.append(f"Penalty - {label}: -{amount:.2f}.")
|
| 64 |
else:
|
| 65 |
lines.append("No penalties applied.")
|
| 66 |
|
| 67 |
+
lines.append(
|
| 68 |
+
"Total reward: "
|
| 69 |
+
f"{total:.2f} "
|
| 70 |
+
"(formula: 10 x rigor x feasibility x fidelity x parsimony + bonuses - penalties)."
|
| 71 |
+
)
|
| 72 |
|
| 73 |
return "\n".join(lines)
|
replicalab/scoring/rubric.py
CHANGED
|
@@ -1,11 +1,14 @@
|
|
| 1 |
-
"""JDG 04-05
|
| 2 |
|
| 3 |
-
Combines rigor (JDG 01), feasibility (JDG 02),
|
| 4 |
-
into a single scalar reward with
|
|
|
|
| 5 |
|
| 6 |
-
Formula:
|
|
|
|
|
|
|
| 7 |
|
| 8 |
-
Pure deterministic functions
|
| 9 |
"""
|
| 10 |
|
| 11 |
from __future__ import annotations
|
|
@@ -27,12 +30,14 @@ _MAX_COMMUNICATION_BONUS = 0.0 # reserved for future use
|
|
| 27 |
|
| 28 |
|
| 29 |
def compute_total_reward(breakdown: RewardBreakdown) -> float:
|
| 30 |
-
"""Compute the scalar reward from a RewardBreakdown.
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
|
|
|
|
|
|
| 36 |
bonus = breakdown.efficiency_bonus + breakdown.communication_bonus
|
| 37 |
penalty = sum(breakdown.penalties.values())
|
| 38 |
return max(0.0, round(base + bonus - penalty, 6))
|
|
@@ -47,32 +52,14 @@ def build_reward_breakdown(
|
|
| 47 |
check: FeasibilityCheckResult | None = None,
|
| 48 |
penalties: dict[str, float] | None = None,
|
| 49 |
) -> RewardBreakdown:
|
| 50 |
-
"""Build a full RewardBreakdown from the
|
| 51 |
-
|
| 52 |
-
Parameters
|
| 53 |
-
----------
|
| 54 |
-
protocol : Protocol
|
| 55 |
-
The final agreed protocol.
|
| 56 |
-
scenario : NormalizedScenarioPack
|
| 57 |
-
The scenario pack for this episode.
|
| 58 |
-
rounds_used : int
|
| 59 |
-
How many rounds were consumed.
|
| 60 |
-
max_rounds : int
|
| 61 |
-
The episode's round cap.
|
| 62 |
-
check : FeasibilityCheckResult, optional
|
| 63 |
-
Pre-computed feasibility check to avoid redundant work.
|
| 64 |
-
penalties : dict[str, float], optional
|
| 65 |
-
Named penalty keys for bounded-tool diagnostics, unsupported
|
| 66 |
-
evidence claims, or other deterministic deductions. Use named
|
| 67 |
-
keys (e.g. ``"invalid_tool_use"``, ``"unsupported_claim"``)
|
| 68 |
-
instead of adding new fields to RewardBreakdown.
|
| 69 |
-
"""
|
| 70 |
if check is None:
|
| 71 |
check = check_feasibility(protocol, scenario)
|
| 72 |
|
| 73 |
rigor = score_rigor(protocol, scenario)
|
| 74 |
feasibility = score_feasibility(protocol, scenario, check=check)
|
| 75 |
fidelity = score_fidelity(protocol, scenario)
|
|
|
|
| 76 |
|
| 77 |
efficiency_bonus = _efficiency_bonus(rounds_used, max_rounds)
|
| 78 |
merged_penalties = dict(penalties) if penalties else {}
|
|
@@ -81,6 +68,7 @@ def build_reward_breakdown(
|
|
| 81 |
rigor=rigor,
|
| 82 |
feasibility=feasibility,
|
| 83 |
fidelity=fidelity,
|
|
|
|
| 84 |
efficiency_bonus=efficiency_bonus,
|
| 85 |
communication_bonus=0.0,
|
| 86 |
penalties=merged_penalties,
|
|
@@ -88,12 +76,33 @@ def build_reward_breakdown(
|
|
| 88 |
|
| 89 |
|
| 90 |
def _efficiency_bonus(rounds_used: int, max_rounds: int) -> float:
|
| 91 |
-
"""Reward finishing in fewer rounds.
|
| 92 |
-
|
| 93 |
-
If the scientist reaches agreement in round 1 of 6, that's the maximum
|
| 94 |
-
bonus. If they use all rounds, the bonus is 0.
|
| 95 |
-
"""
|
| 96 |
if max_rounds <= 1 or rounds_used <= 0:
|
| 97 |
return 0.0
|
| 98 |
saved = max(0, max_rounds - rounds_used)
|
| 99 |
return round(_MAX_EFFICIENCY_BONUS * saved / (max_rounds - 1), 6)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""JDG 04-05 - Total reward computation and reward breakdown builder.
|
| 2 |
|
| 3 |
+
Combines rigor (JDG 01), feasibility (JDG 02), fidelity (JDG 03), and a
|
| 4 |
+
lightweight parsimony term into a single scalar reward with bonuses and
|
| 5 |
+
named penalties.
|
| 6 |
|
| 7 |
+
Formula:
|
| 8 |
+
total = 10 * rigor * feasibility * fidelity * parsimony
|
| 9 |
+
+ bonuses - penalties
|
| 10 |
|
| 11 |
+
Pure deterministic functions - no model calls, no side effects.
|
| 12 |
"""
|
| 13 |
|
| 14 |
from __future__ import annotations
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
def compute_total_reward(breakdown: RewardBreakdown) -> float:
|
| 33 |
+
"""Compute the scalar reward from a RewardBreakdown."""
|
| 34 |
+
base = (
|
| 35 |
+
_REWARD_SCALE
|
| 36 |
+
* breakdown.rigor
|
| 37 |
+
* breakdown.feasibility
|
| 38 |
+
* breakdown.fidelity
|
| 39 |
+
* breakdown.parsimony
|
| 40 |
+
)
|
| 41 |
bonus = breakdown.efficiency_bonus + breakdown.communication_bonus
|
| 42 |
penalty = sum(breakdown.penalties.values())
|
| 43 |
return max(0.0, round(base + bonus - penalty, 6))
|
|
|
|
| 52 |
check: FeasibilityCheckResult | None = None,
|
| 53 |
penalties: dict[str, float] | None = None,
|
| 54 |
) -> RewardBreakdown:
|
| 55 |
+
"""Build a full RewardBreakdown from the sub-scores plus bonuses."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
if check is None:
|
| 57 |
check = check_feasibility(protocol, scenario)
|
| 58 |
|
| 59 |
rigor = score_rigor(protocol, scenario)
|
| 60 |
feasibility = score_feasibility(protocol, scenario, check=check)
|
| 61 |
fidelity = score_fidelity(protocol, scenario)
|
| 62 |
+
parsimony = _score_parsimony(protocol, scenario)
|
| 63 |
|
| 64 |
efficiency_bonus = _efficiency_bonus(rounds_used, max_rounds)
|
| 65 |
merged_penalties = dict(penalties) if penalties else {}
|
|
|
|
| 68 |
rigor=rigor,
|
| 69 |
feasibility=feasibility,
|
| 70 |
fidelity=fidelity,
|
| 71 |
+
parsimony=parsimony,
|
| 72 |
efficiency_bonus=efficiency_bonus,
|
| 73 |
communication_bonus=0.0,
|
| 74 |
penalties=merged_penalties,
|
|
|
|
| 76 |
|
| 77 |
|
| 78 |
def _efficiency_bonus(rounds_used: int, max_rounds: int) -> float:
|
| 79 |
+
"""Reward finishing in fewer rounds."""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
if max_rounds <= 1 or rounds_used <= 0:
|
| 81 |
return 0.0
|
| 82 |
saved = max(0, max_rounds - rounds_used)
|
| 83 |
return round(_MAX_EFFICIENCY_BONUS * saved / (max_rounds - 1), 6)
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def _score_parsimony(
|
| 87 |
+
protocol: Protocol,
|
| 88 |
+
scenario: NormalizedScenarioPack,
|
| 89 |
+
) -> float:
|
| 90 |
+
"""Score how lean the protocol is relative to scenario complexity.
|
| 91 |
+
|
| 92 |
+
The current scenario schema does not expose explicit "necessary resource"
|
| 93 |
+
labels, so we infer complexity from the hidden required-element count and
|
| 94 |
+
penalize plans that request far more unique controls/resources than that
|
| 95 |
+
complexity suggests.
|
| 96 |
+
"""
|
| 97 |
+
required_element_count = len(scenario.hidden_reference_spec.required_elements)
|
| 98 |
+
complexity_budget = max(2, required_element_count + 2)
|
| 99 |
+
requested_count = (
|
| 100 |
+
len(set(protocol.controls))
|
| 101 |
+
+ len(set(protocol.required_equipment))
|
| 102 |
+
+ len(set(protocol.required_reagents))
|
| 103 |
+
)
|
| 104 |
+
if requested_count <= 0:
|
| 105 |
+
return 1.0
|
| 106 |
+
|
| 107 |
+
ratio = complexity_budget / max(complexity_budget, requested_count)
|
| 108 |
+
return round(max(0.25, min(1.0, ratio)), 6)
|
replicalab/training/rollout.py
CHANGED
|
@@ -164,9 +164,9 @@ class RolloutWorker:
|
|
| 164 |
)
|
| 165 |
record.steps.append(step)
|
| 166 |
record.tool_traces.extend(tool_traces)
|
|
|
|
| 167 |
|
| 168 |
if result.done:
|
| 169 |
-
record.total_reward = result.reward
|
| 170 |
record.reward_breakdown = result.info.reward_breakdown
|
| 171 |
record.judge_notes = result.info.judge_notes
|
| 172 |
record.verdict = result.info.verdict
|
|
|
|
| 164 |
)
|
| 165 |
record.steps.append(step)
|
| 166 |
record.tool_traces.extend(tool_traces)
|
| 167 |
+
record.total_reward = round(record.total_reward + result.reward, 6)
|
| 168 |
|
| 169 |
if result.done:
|
|
|
|
| 170 |
record.reward_breakdown = result.info.reward_breakdown
|
| 171 |
record.judge_notes = result.info.judge_notes
|
| 172 |
record.verdict = result.info.verdict
|
server/app.py
CHANGED
|
@@ -111,7 +111,7 @@ def _build_episode_log(
|
|
| 111 |
final_state=state,
|
| 112 |
transcript=list(state.conversation_history),
|
| 113 |
reward_breakdown=info.reward_breakdown,
|
| 114 |
-
total_reward=
|
| 115 |
rounds_used=state.round_number,
|
| 116 |
agreement_reached=info.agreement_reached,
|
| 117 |
judge_notes=info.judge_notes or "",
|
|
|
|
| 111 |
final_state=state,
|
| 112 |
transcript=list(state.conversation_history),
|
| 113 |
reward_breakdown=info.reward_breakdown,
|
| 114 |
+
total_reward=state.reward,
|
| 115 |
rounds_used=state.round_number,
|
| 116 |
agreement_reached=info.agreement_reached,
|
| 117 |
judge_notes=info.judge_notes or "",
|
tests/test_cache.py
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
|
| 5 |
+
from replicalab.cache import CachedOracle, ScenarioCache
|
| 6 |
+
from replicalab.oracle_models import Scenario
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def _scenario_payload() -> dict:
|
| 10 |
+
return {
|
| 11 |
+
"paper": {
|
| 12 |
+
"title": "Cached benchmark",
|
| 13 |
+
"domain": "ml_benchmark",
|
| 14 |
+
"claim": "A small run remains useful under a tighter budget.",
|
| 15 |
+
"method_summary": "Train a compact model and verify against a held-out split.",
|
| 16 |
+
"original_sample_size": 1000,
|
| 17 |
+
"original_duration_days": 2,
|
| 18 |
+
"original_technique": "compact_model",
|
| 19 |
+
"required_controls": ["baseline"],
|
| 20 |
+
"required_equipment": ["GPU cluster"],
|
| 21 |
+
"required_reagents": ["dataset snapshot"],
|
| 22 |
+
"statistical_test": "accuracy_gap",
|
| 23 |
+
},
|
| 24 |
+
"lab_constraints": {
|
| 25 |
+
"budget_total": 1200.0,
|
| 26 |
+
"budget_remaining": 1200.0,
|
| 27 |
+
"equipment": [
|
| 28 |
+
{
|
| 29 |
+
"name": "GPU cluster",
|
| 30 |
+
"available": True,
|
| 31 |
+
"condition": "operational",
|
| 32 |
+
"booking_conflicts": [],
|
| 33 |
+
"cost_per_use": 100.0,
|
| 34 |
+
}
|
| 35 |
+
],
|
| 36 |
+
"reagents": [
|
| 37 |
+
{
|
| 38 |
+
"name": "dataset snapshot",
|
| 39 |
+
"in_stock": True,
|
| 40 |
+
"quantity_available": 1.0,
|
| 41 |
+
"unit": "copy",
|
| 42 |
+
"lead_time_days": 0,
|
| 43 |
+
"cost": 0.0,
|
| 44 |
+
}
|
| 45 |
+
],
|
| 46 |
+
"staff": [],
|
| 47 |
+
"max_duration_days": 3,
|
| 48 |
+
"safety_rules": ["No external internet."],
|
| 49 |
+
"valid_substitutions": [],
|
| 50 |
+
},
|
| 51 |
+
"minimum_viable_spec": {
|
| 52 |
+
"min_sample_size": 800,
|
| 53 |
+
"must_keep_controls": ["baseline"],
|
| 54 |
+
"acceptable_techniques": ["compact_model"],
|
| 55 |
+
"min_duration_days": 1,
|
| 56 |
+
"critical_equipment": ["GPU cluster"],
|
| 57 |
+
"flexible_equipment": [],
|
| 58 |
+
"critical_reagents": ["dataset snapshot"],
|
| 59 |
+
"flexible_reagents": [],
|
| 60 |
+
"power_threshold": 0.75,
|
| 61 |
+
},
|
| 62 |
+
"difficulty": "easy",
|
| 63 |
+
"narrative_hook": "The benchmark owners tightened the reporting budget.",
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def test_scenario_cache_round_trips(tmp_path) -> None:
|
| 68 |
+
cache = ScenarioCache(tmp_path)
|
| 69 |
+
scenario = Scenario.model_validate(_scenario_payload())
|
| 70 |
+
|
| 71 |
+
path = cache.put(13, "easy", "ml_benchmark", scenario)
|
| 72 |
+
restored = cache.get(13, "easy", "ml_benchmark")
|
| 73 |
+
|
| 74 |
+
assert path.exists()
|
| 75 |
+
assert restored is not None
|
| 76 |
+
assert restored.model_dump(mode="json") == scenario.model_dump(mode="json")
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def test_cached_oracle_uses_cache_after_first_generation(tmp_path) -> None:
|
| 80 |
+
calls = {"count": 0}
|
| 81 |
+
|
| 82 |
+
def fake_client(system: str, user: str, model: str) -> str:
|
| 83 |
+
calls["count"] += 1
|
| 84 |
+
return json.dumps(_scenario_payload())
|
| 85 |
+
|
| 86 |
+
oracle = CachedOracle(fake_client, cache=ScenarioCache(tmp_path))
|
| 87 |
+
|
| 88 |
+
first = oracle.generate_scenario(9, "easy", "ml_benchmark")
|
| 89 |
+
second = oracle.generate_scenario(9, "easy", "ml_benchmark")
|
| 90 |
+
|
| 91 |
+
assert first.model_dump(mode="json") == second.model_dump(mode="json")
|
| 92 |
+
assert calls["count"] == 1
|
tests/test_env.py
CHANGED
|
@@ -342,7 +342,9 @@ class TestStep:
|
|
| 342 |
|
| 343 |
assert env.state().round_number == 1
|
| 344 |
assert result.done is False
|
| 345 |
-
assert result.reward
|
|
|
|
|
|
|
| 346 |
|
| 347 |
def test_step_returns_observations(self) -> None:
|
| 348 |
env = ReplicaLabEnv()
|
|
@@ -416,7 +418,9 @@ class TestStep:
|
|
| 416 |
|
| 417 |
assert result.done is True
|
| 418 |
assert result.info.agreement_reached is False
|
| 419 |
-
assert result.reward
|
|
|
|
|
|
|
| 420 |
|
| 421 |
def test_step_info_has_round_and_episode_id(self) -> None:
|
| 422 |
env = ReplicaLabEnv()
|
|
@@ -655,7 +659,8 @@ class TestEnvReward:
|
|
| 655 |
assert result.done
|
| 656 |
assert result.info.verdict == "timeout"
|
| 657 |
assert result.info.reward_breakdown is not None
|
| 658 |
-
assert result.reward
|
|
|
|
| 659 |
|
| 660 |
def test_episode_state_stores_final_scores(self) -> None:
|
| 661 |
env = ReplicaLabEnv()
|
|
|
|
| 342 |
|
| 343 |
assert env.state().round_number == 1
|
| 344 |
assert result.done is False
|
| 345 |
+
assert result.reward > 0.0
|
| 346 |
+
assert result.info.step_reward_components["protocol_delta_bonus"] > 0.0
|
| 347 |
+
assert result.info.cumulative_reward == result.reward
|
| 348 |
|
| 349 |
def test_step_returns_observations(self) -> None:
|
| 350 |
env = ReplicaLabEnv()
|
|
|
|
| 418 |
|
| 419 |
assert result.done is True
|
| 420 |
assert result.info.agreement_reached is False
|
| 421 |
+
assert result.reward < 0.0
|
| 422 |
+
assert result.info.reward_breakdown is not None
|
| 423 |
+
assert result.info.reward_breakdown.penalties["timeout"] > 0.0
|
| 424 |
|
| 425 |
def test_step_info_has_round_and_episode_id(self) -> None:
|
| 426 |
env = ReplicaLabEnv()
|
|
|
|
| 659 |
assert result.done
|
| 660 |
assert result.info.verdict == "timeout"
|
| 661 |
assert result.info.reward_breakdown is not None
|
| 662 |
+
assert result.reward < 0.0
|
| 663 |
+
assert result.info.reward_breakdown.penalties["timeout"] > 0.0
|
| 664 |
|
| 665 |
def test_episode_state_stores_final_scores(self) -> None:
|
| 666 |
env = ReplicaLabEnv()
|
tests/test_oracle.py
ADDED
|
@@ -0,0 +1,281 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
|
| 5 |
+
from replicalab.agents.lab_manager_agent import LabManagerAgent
|
| 6 |
+
from replicalab.env import ReplicaLabEnv
|
| 7 |
+
from replicalab.models import ScientistAction
|
| 8 |
+
from replicalab.oracle import Oracle
|
| 9 |
+
from replicalab.oracle_models import (
|
| 10 |
+
AdjudicatorRoundScore,
|
| 11 |
+
EnvironmentEvent,
|
| 12 |
+
OracleLabManagerObservation,
|
| 13 |
+
PostMortem,
|
| 14 |
+
Scenario,
|
| 15 |
+
)
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def _scenario_payload() -> dict:
|
| 19 |
+
return {
|
| 20 |
+
"paper": {
|
| 21 |
+
"title": "Reproducing a Small Vision Benchmark",
|
| 22 |
+
"domain": "ml_benchmark",
|
| 23 |
+
"claim": "A compact model can recover >90% of reference accuracy under budget.",
|
| 24 |
+
"method_summary": "Train a compact CNN with fixed augmentations and evaluate on a held-out split.",
|
| 25 |
+
"original_sample_size": 1200,
|
| 26 |
+
"original_duration_days": 3,
|
| 27 |
+
"original_technique": "compact_cnn",
|
| 28 |
+
"required_controls": ["seed_control", "baseline_model"],
|
| 29 |
+
"required_equipment": ["GPU cluster", "validation server"],
|
| 30 |
+
"required_reagents": ["dataset snapshot"],
|
| 31 |
+
"statistical_test": "accuracy_gap",
|
| 32 |
+
},
|
| 33 |
+
"lab_constraints": {
|
| 34 |
+
"budget_total": 2400.0,
|
| 35 |
+
"budget_remaining": 2400.0,
|
| 36 |
+
"equipment": [
|
| 37 |
+
{
|
| 38 |
+
"name": "GPU cluster",
|
| 39 |
+
"available": True,
|
| 40 |
+
"condition": "shared_booking",
|
| 41 |
+
"booking_conflicts": ["Monday"],
|
| 42 |
+
"cost_per_use": 250.0,
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"name": "Validation server",
|
| 46 |
+
"available": True,
|
| 47 |
+
"condition": "operational",
|
| 48 |
+
"booking_conflicts": [],
|
| 49 |
+
"cost_per_use": 20.0,
|
| 50 |
+
},
|
| 51 |
+
],
|
| 52 |
+
"reagents": [
|
| 53 |
+
{
|
| 54 |
+
"name": "dataset snapshot",
|
| 55 |
+
"in_stock": True,
|
| 56 |
+
"quantity_available": 1.0,
|
| 57 |
+
"unit": "copy",
|
| 58 |
+
"lead_time_days": 0,
|
| 59 |
+
"cost": 0.0,
|
| 60 |
+
}
|
| 61 |
+
],
|
| 62 |
+
"staff": [
|
| 63 |
+
{
|
| 64 |
+
"name": "Alex",
|
| 65 |
+
"role": "engineer",
|
| 66 |
+
"available_days": ["Monday", "Tuesday"],
|
| 67 |
+
"skills": ["training", "evaluation"],
|
| 68 |
+
}
|
| 69 |
+
],
|
| 70 |
+
"max_duration_days": 5,
|
| 71 |
+
"safety_rules": ["No external internet during training."],
|
| 72 |
+
"valid_substitutions": [
|
| 73 |
+
{
|
| 74 |
+
"original": "GPU cluster",
|
| 75 |
+
"substitute": "single high-memory GPU",
|
| 76 |
+
"validity": "acceptable_with_caveats",
|
| 77 |
+
"caveats": "Lower throughput is acceptable if evaluation fidelity is preserved.",
|
| 78 |
+
}
|
| 79 |
+
],
|
| 80 |
+
},
|
| 81 |
+
"minimum_viable_spec": {
|
| 82 |
+
"min_sample_size": 800,
|
| 83 |
+
"must_keep_controls": ["seed_control", "baseline_model"],
|
| 84 |
+
"acceptable_techniques": ["compact_cnn", "distilled_cnn"],
|
| 85 |
+
"min_duration_days": 2,
|
| 86 |
+
"critical_equipment": ["Validation server"],
|
| 87 |
+
"flexible_equipment": ["GPU cluster"],
|
| 88 |
+
"critical_reagents": ["dataset snapshot"],
|
| 89 |
+
"flexible_reagents": [],
|
| 90 |
+
"power_threshold": 0.8,
|
| 91 |
+
},
|
| 92 |
+
"difficulty": "medium",
|
| 93 |
+
"narrative_hook": "The compute team just reduced your preferred GPU window.",
|
| 94 |
+
}
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
def _round_score_payload() -> dict:
|
| 98 |
+
return {
|
| 99 |
+
"rigor_flags": ["kept baseline_model"],
|
| 100 |
+
"feasibility_flags": ["GPU window narrowed"],
|
| 101 |
+
"info_gain": 0.6,
|
| 102 |
+
"protocol_delta": 0.4,
|
| 103 |
+
"momentum": 0.7,
|
| 104 |
+
"contradiction_detected": False,
|
| 105 |
+
"stalling_detected": False,
|
| 106 |
+
"step_reward": 0.55,
|
| 107 |
+
"notes": "Scientist asked a useful scheduling question and preserved controls.",
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def _post_mortem_payload() -> dict:
|
| 112 |
+
return {
|
| 113 |
+
"overall_summary": "The Scientist converged on a feasible compact CNN plan.",
|
| 114 |
+
"rigor_explanation": "Controls and the validation server were preserved.",
|
| 115 |
+
"feasibility_explanation": "The final plan fit the available compute and duration window.",
|
| 116 |
+
"fidelity_explanation": "The protocol stayed close to the benchmark setup.",
|
| 117 |
+
"key_decisions": ["Kept seed control", "Accepted lower-throughput compute"],
|
| 118 |
+
"missed_opportunities": ["Could have asked about booking conflicts earlier"],
|
| 119 |
+
"comparison_note": "An optimal Scientist would have requested the alternate GPU window one round sooner.",
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
class _FakeMessagesAPI:
|
| 124 |
+
def __init__(self, payloads: list[dict]) -> None:
|
| 125 |
+
self._payloads = payloads
|
| 126 |
+
self.calls = 0
|
| 127 |
+
|
| 128 |
+
def create(self, **_: object):
|
| 129 |
+
payload = self._payloads[self.calls]
|
| 130 |
+
self.calls += 1
|
| 131 |
+
|
| 132 |
+
class _Chunk:
|
| 133 |
+
def __init__(self, text: str) -> None:
|
| 134 |
+
self.text = text
|
| 135 |
+
|
| 136 |
+
class _Response:
|
| 137 |
+
def __init__(self, text: str) -> None:
|
| 138 |
+
self.content = [_Chunk(text)]
|
| 139 |
+
|
| 140 |
+
return _Response(json.dumps(payload))
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
class _FakeClient:
|
| 144 |
+
def __init__(self, payloads: list[dict]) -> None:
|
| 145 |
+
self.messages = _FakeMessagesAPI(payloads)
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
def test_oracle_generate_scenario_parses_json() -> None:
|
| 149 |
+
oracle = Oracle(_FakeClient([_scenario_payload()]))
|
| 150 |
+
|
| 151 |
+
scenario = oracle.generate_scenario(seed=7, difficulty="medium", domain="ml_benchmark")
|
| 152 |
+
|
| 153 |
+
assert isinstance(scenario, Scenario)
|
| 154 |
+
assert scenario.paper.domain == "ml_benchmark"
|
| 155 |
+
assert scenario.lab_constraints.equipment[0].name == "GPU cluster"
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
def test_oracle_score_round_parses_structured_payload() -> None:
|
| 159 |
+
oracle = Oracle(_FakeClient([_round_score_payload()]))
|
| 160 |
+
scenario = Scenario.model_validate(_scenario_payload())
|
| 161 |
+
action = ScientistAction(
|
| 162 |
+
action_type="request_info",
|
| 163 |
+
sample_size=0,
|
| 164 |
+
controls=[],
|
| 165 |
+
technique="",
|
| 166 |
+
duration_days=0,
|
| 167 |
+
required_equipment=[],
|
| 168 |
+
required_reagents=[],
|
| 169 |
+
questions=["When is the GPU cluster available?"],
|
| 170 |
+
rationale="",
|
| 171 |
+
)
|
| 172 |
+
lab_manager = LabManagerAgent(_FakeClient([{
|
| 173 |
+
"response_type": "feasibility_report",
|
| 174 |
+
"feasible": False,
|
| 175 |
+
"issues": ["GPU cluster is shared-booked on Monday"],
|
| 176 |
+
"suggestions": ["Use the single high-memory GPU instead"],
|
| 177 |
+
"cost_estimate": 250.0,
|
| 178 |
+
"time_estimate_days": 3,
|
| 179 |
+
"message": "The GPU cluster is shared-booked Monday; the single high-memory GPU is acceptable with caveats.",
|
| 180 |
+
}]))
|
| 181 |
+
response = lab_manager.respond(
|
| 182 |
+
OracleLabManagerObservation(
|
| 183 |
+
lab_constraints=scenario.lab_constraints,
|
| 184 |
+
current_protocol=None,
|
| 185 |
+
scientist_action=action,
|
| 186 |
+
round_number=1,
|
| 187 |
+
)
|
| 188 |
+
)
|
| 189 |
+
|
| 190 |
+
score = oracle.score_round(
|
| 191 |
+
scenario=scenario,
|
| 192 |
+
round_number=1,
|
| 193 |
+
scientist_action=action,
|
| 194 |
+
lab_manager_response=response,
|
| 195 |
+
conversation_history=[],
|
| 196 |
+
current_protocol=None,
|
| 197 |
+
previous_scores=[],
|
| 198 |
+
)
|
| 199 |
+
|
| 200 |
+
assert isinstance(score, AdjudicatorRoundScore)
|
| 201 |
+
assert score.step_reward == 0.55
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
def test_oracle_maybe_inject_event_returns_optional_event() -> None:
|
| 205 |
+
oracle = Oracle(_FakeClient([{"inject": True, "event": {
|
| 206 |
+
"event_type": "budget_cut",
|
| 207 |
+
"description": "Finance reduced the remaining budget.",
|
| 208 |
+
"state_changes": {"lab_constraints.budget_remaining": 1800.0},
|
| 209 |
+
"severity": "moderate",
|
| 210 |
+
}}]))
|
| 211 |
+
|
| 212 |
+
event = oracle.maybe_inject_event(
|
| 213 |
+
scenario=Scenario.model_validate(_scenario_payload()),
|
| 214 |
+
round_number=3,
|
| 215 |
+
current_protocol=None,
|
| 216 |
+
conversation_history=[],
|
| 217 |
+
inject_enabled=True,
|
| 218 |
+
)
|
| 219 |
+
|
| 220 |
+
assert isinstance(event, EnvironmentEvent)
|
| 221 |
+
assert event.event_type == "budget_cut"
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
def test_oracle_generate_post_mortem_parses_json() -> None:
|
| 225 |
+
oracle = Oracle(_FakeClient([_post_mortem_payload()]))
|
| 226 |
+
from replicalab.oracle_models import AdjudicatorTerminalScore
|
| 227 |
+
|
| 228 |
+
post_mortem = oracle.generate_post_mortem(
|
| 229 |
+
scenario=Scenario.model_validate(_scenario_payload()),
|
| 230 |
+
final_protocol={"technique": "compact_cnn"},
|
| 231 |
+
conversation_history=[],
|
| 232 |
+
terminal_score=AdjudicatorTerminalScore(
|
| 233 |
+
rigor=0.9,
|
| 234 |
+
feasibility=0.8,
|
| 235 |
+
fidelity=0.85,
|
| 236 |
+
parsimony=0.9,
|
| 237 |
+
robustness=0.8,
|
| 238 |
+
power_preservation=0.8,
|
| 239 |
+
efficiency_bonus=0.2,
|
| 240 |
+
communication_bonus=0.1,
|
| 241 |
+
penalties={},
|
| 242 |
+
terminal_reward=5.0,
|
| 243 |
+
total_reward=5.6,
|
| 244 |
+
),
|
| 245 |
+
)
|
| 246 |
+
|
| 247 |
+
assert isinstance(post_mortem, PostMortem)
|
| 248 |
+
assert "feasible compact CNN plan" in post_mortem.overall_summary
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
def test_env_can_reset_from_oracle_scenario_without_changing_outer_contract() -> None:
|
| 252 |
+
class _FakeOracle:
|
| 253 |
+
def __init__(self) -> None:
|
| 254 |
+
self.scenario = Scenario.model_validate(_scenario_payload())
|
| 255 |
+
|
| 256 |
+
def generate_scenario(self, seed: int, difficulty: str, domain: str) -> Scenario:
|
| 257 |
+
assert seed == 11
|
| 258 |
+
assert difficulty == "medium"
|
| 259 |
+
assert domain == "ml_benchmark"
|
| 260 |
+
return self.scenario
|
| 261 |
+
|
| 262 |
+
def score_round(self, **_: object):
|
| 263 |
+
return AdjudicatorRoundScore.model_validate(_round_score_payload())
|
| 264 |
+
|
| 265 |
+
def maybe_inject_event(self, **_: object):
|
| 266 |
+
return None
|
| 267 |
+
|
| 268 |
+
def generate_post_mortem(self, **_: object):
|
| 269 |
+
return PostMortem.model_validate(_post_mortem_payload())
|
| 270 |
+
|
| 271 |
+
env = ReplicaLabEnv(
|
| 272 |
+
oracle=_FakeOracle(),
|
| 273 |
+
enable_oracle_post_mortem=True,
|
| 274 |
+
)
|
| 275 |
+
observation = env.reset(seed=11, scenario="ml_benchmark", difficulty="medium")
|
| 276 |
+
|
| 277 |
+
assert observation.scientist is not None
|
| 278 |
+
assert observation.scientist.paper_title == "Reproducing a Small Vision Benchmark"
|
| 279 |
+
assert observation.lab_manager is not None
|
| 280 |
+
assert "Validation server" in observation.lab_manager.equipment_available
|
| 281 |
+
|
tests/test_prompts.py
CHANGED
|
@@ -3,6 +3,7 @@
|
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
from replicalab.prompts import (
|
|
|
|
| 6 |
load_prompt_template,
|
| 7 |
render_judge_prompt,
|
| 8 |
render_lab_manager_prompt,
|
|
@@ -22,6 +23,18 @@ def test_load_prompt_template_reads_all_role_files() -> None:
|
|
| 22 |
assert "ReplicaLab" in template
|
| 23 |
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
def test_render_scientist_prompt_injects_task_and_bounded_tools() -> None:
|
| 26 |
prompt = render_scientist_prompt(_scenario("ml_benchmark"))
|
| 27 |
|
|
|
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
from replicalab.prompts import (
|
| 6 |
+
load_prompt_asset,
|
| 7 |
load_prompt_template,
|
| 8 |
render_judge_prompt,
|
| 9 |
render_lab_manager_prompt,
|
|
|
|
| 23 |
assert "ReplicaLab" in template
|
| 24 |
|
| 25 |
|
| 26 |
+
def test_load_oracle_prompt_assets_reads_all_oracle_files() -> None:
|
| 27 |
+
for name in (
|
| 28 |
+
"oracle_world_architect",
|
| 29 |
+
"oracle_adjudicator",
|
| 30 |
+
"oracle_event_injector",
|
| 31 |
+
"oracle_post_mortem",
|
| 32 |
+
"oracle_lab_manager",
|
| 33 |
+
):
|
| 34 |
+
template = load_prompt_asset(name)
|
| 35 |
+
assert len(template) > 100
|
| 36 |
+
|
| 37 |
+
|
| 38 |
def test_render_scientist_prompt_injects_task_and_bounded_tools() -> None:
|
| 39 |
prompt = render_scientist_prompt(_scenario("ml_benchmark"))
|
| 40 |
|
tests/test_scenarios.py
CHANGED
|
@@ -7,7 +7,9 @@ from replicalab.scenarios import (
|
|
| 7 |
NormalizedScenarioPack,
|
| 8 |
available_scenario_families,
|
| 9 |
generate_scenario,
|
|
|
|
| 10 |
)
|
|
|
|
| 11 |
|
| 12 |
|
| 13 |
def test_generate_scenario_is_deterministic_for_same_seed() -> None:
|
|
@@ -140,3 +142,90 @@ def test_all_domains_produce_bookings_and_windows() -> None:
|
|
| 140 |
pack = generate_scenario(seed=42, template=template, difficulty="medium")
|
| 141 |
assert len(pack.resource_bookings) > 0, f"{template} has no bookings"
|
| 142 |
assert len(pack.scheduling_windows) > 0, f"{template} has no windows"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
NormalizedScenarioPack,
|
| 8 |
available_scenario_families,
|
| 9 |
generate_scenario,
|
| 10 |
+
oracle_scenario_to_normalized_pack,
|
| 11 |
)
|
| 12 |
+
from replicalab.oracle_models import Scenario as OracleScenario
|
| 13 |
|
| 14 |
|
| 15 |
def test_generate_scenario_is_deterministic_for_same_seed() -> None:
|
|
|
|
| 142 |
pack = generate_scenario(seed=42, template=template, difficulty="medium")
|
| 143 |
assert len(pack.resource_bookings) > 0, f"{template} has no bookings"
|
| 144 |
assert len(pack.scheduling_windows) > 0, f"{template} has no windows"
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
def test_oracle_scenario_adapter_preserves_domain_and_constraints() -> None:
|
| 148 |
+
oracle_scenario = OracleScenario.model_validate(
|
| 149 |
+
{
|
| 150 |
+
"paper": {
|
| 151 |
+
"title": "Adapting a benchmark under constraint",
|
| 152 |
+
"domain": "ml_benchmark",
|
| 153 |
+
"claim": "A small model remains competitive after budget cuts.",
|
| 154 |
+
"method_summary": "Train a compact benchmark baseline with fixed controls.",
|
| 155 |
+
"original_sample_size": 1200,
|
| 156 |
+
"original_duration_days": 3,
|
| 157 |
+
"original_technique": "compact_cnn",
|
| 158 |
+
"required_controls": ["baseline", "seed_control"],
|
| 159 |
+
"required_equipment": ["GPU cluster"],
|
| 160 |
+
"required_reagents": ["dataset snapshot"],
|
| 161 |
+
"statistical_test": "accuracy_gap",
|
| 162 |
+
},
|
| 163 |
+
"lab_constraints": {
|
| 164 |
+
"budget_total": 1800.0,
|
| 165 |
+
"budget_remaining": 1500.0,
|
| 166 |
+
"equipment": [
|
| 167 |
+
{
|
| 168 |
+
"name": "GPU cluster",
|
| 169 |
+
"available": True,
|
| 170 |
+
"condition": "shared_booking",
|
| 171 |
+
"booking_conflicts": ["Monday"],
|
| 172 |
+
"cost_per_use": 200.0,
|
| 173 |
+
}
|
| 174 |
+
],
|
| 175 |
+
"reagents": [
|
| 176 |
+
{
|
| 177 |
+
"name": "dataset snapshot",
|
| 178 |
+
"in_stock": True,
|
| 179 |
+
"quantity_available": 1.0,
|
| 180 |
+
"unit": "copy",
|
| 181 |
+
"lead_time_days": 0,
|
| 182 |
+
"cost": 0.0,
|
| 183 |
+
}
|
| 184 |
+
],
|
| 185 |
+
"staff": [
|
| 186 |
+
{
|
| 187 |
+
"name": "Alex",
|
| 188 |
+
"role": "engineer",
|
| 189 |
+
"available_days": ["Monday", "Tuesday"],
|
| 190 |
+
"skills": ["training", "evaluation"],
|
| 191 |
+
}
|
| 192 |
+
],
|
| 193 |
+
"max_duration_days": 5,
|
| 194 |
+
"safety_rules": ["No external internet during training."],
|
| 195 |
+
"valid_substitutions": [
|
| 196 |
+
{
|
| 197 |
+
"original": "GPU cluster",
|
| 198 |
+
"substitute": "single high-memory GPU",
|
| 199 |
+
"validity": "acceptable_with_caveats",
|
| 200 |
+
"caveats": "Longer runtime is acceptable if evaluation fidelity is preserved.",
|
| 201 |
+
}
|
| 202 |
+
],
|
| 203 |
+
},
|
| 204 |
+
"minimum_viable_spec": {
|
| 205 |
+
"min_sample_size": 800,
|
| 206 |
+
"must_keep_controls": ["baseline", "seed_control"],
|
| 207 |
+
"acceptable_techniques": ["compact_cnn"],
|
| 208 |
+
"min_duration_days": 2,
|
| 209 |
+
"critical_equipment": ["GPU cluster"],
|
| 210 |
+
"flexible_equipment": [],
|
| 211 |
+
"critical_reagents": ["dataset snapshot"],
|
| 212 |
+
"flexible_reagents": [],
|
| 213 |
+
"power_threshold": 0.8,
|
| 214 |
+
},
|
| 215 |
+
"difficulty": "medium",
|
| 216 |
+
"narrative_hook": "The preferred GPU window has been partially reallocated.",
|
| 217 |
+
}
|
| 218 |
+
)
|
| 219 |
+
|
| 220 |
+
pack = oracle_scenario_to_normalized_pack(
|
| 221 |
+
seed=7,
|
| 222 |
+
template="ml_benchmark",
|
| 223 |
+
oracle_scenario=oracle_scenario,
|
| 224 |
+
)
|
| 225 |
+
|
| 226 |
+
assert pack.domain_id == "ml_benchmark"
|
| 227 |
+
assert pack.scientist_observation.paper_title == "Adapting a benchmark under constraint"
|
| 228 |
+
assert pack.lab_manager_observation.budget_total == 1800.0
|
| 229 |
+
assert "GPU cluster" in pack.lab_manager_observation.equipment_booked
|
| 230 |
+
assert pack.hidden_reference_spec.required_elements
|
| 231 |
+
assert pack.resource_bookings
|
tests/test_server.py
CHANGED
|
@@ -509,7 +509,8 @@ class TestWebSocket:
|
|
| 509 |
|
| 510 |
assert resp["type"] == "step_ok"
|
| 511 |
assert resp["done"] is False
|
| 512 |
-
assert resp["reward"]
|
|
|
|
| 513 |
assert resp["observation"] is not None
|
| 514 |
|
| 515 |
def test_ws_full_episode_real_reward(self, client: TestClient) -> None:
|
|
@@ -627,8 +628,9 @@ class TestWebSocket:
|
|
| 627 |
|
| 628 |
assert resp["done"] is True
|
| 629 |
assert resp["info"]["verdict"] == "timeout"
|
| 630 |
-
assert resp["reward"]
|
| 631 |
assert resp["info"]["reward_breakdown"] is not None
|
|
|
|
| 632 |
|
| 633 |
def test_ws_terminal_episode_persists_real_replay_log(
|
| 634 |
self, client: TestClient
|
|
|
|
| 509 |
|
| 510 |
assert resp["type"] == "step_ok"
|
| 511 |
assert resp["done"] is False
|
| 512 |
+
assert resp["reward"] > 0.0
|
| 513 |
+
assert resp["info"]["step_reward_components"]["protocol_delta_bonus"] > 0.0
|
| 514 |
assert resp["observation"] is not None
|
| 515 |
|
| 516 |
def test_ws_full_episode_real_reward(self, client: TestClient) -> None:
|
|
|
|
| 628 |
|
| 629 |
assert resp["done"] is True
|
| 630 |
assert resp["info"]["verdict"] == "timeout"
|
| 631 |
+
assert resp["reward"] < 0.0
|
| 632 |
assert resp["info"]["reward_breakdown"] is not None
|
| 633 |
+
assert resp["info"]["reward_breakdown"]["penalties"]["timeout"] > 0.0
|
| 634 |
|
| 635 |
def test_ws_terminal_episode_persists_real_replay_log(
|
| 636 |
self, client: TestClient
|