Spaces:
Running
Running
Add deterministic judge scoring engine (JDG 01-03)
Browse filesJDG 01 (rigor): structural completeness + success criteria + required
element coverage. Weighted sub-scores with token matching.
JDG 02 (feasibility): continuous signal from FeasibilityCheckResult
with partial credit for budget, equipment, reagents, and staff.
Does not rescore β wraps AGT 05 to prevent drift.
JDG 03 (fidelity): substitution-aware adherence to hidden reference
spec. Direct match = 1.0, allowed substitution = 0.7, miss = 0.0.
Shared text helpers extracted to replicalab/utils/text.py.
18 new tests covering all 3 scorers across all domains and difficulties.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- docs/map/README.md +15 -8
- docs/map/scoring.md +162 -31
- docs/map/tests.md +5 -5
- replicalab/scoring/__init__.py +11 -0
- replicalab/scoring/feasibility.py +98 -0
- replicalab/scoring/fidelity.py +147 -0
- replicalab/scoring/rigor.py +114 -0
- replicalab/utils/text.py +18 -0
- replicalab/utils/validation.py +2 -19
- tests/test_reward.py +303 -0
docs/map/README.md
CHANGED
|
@@ -3,7 +3,7 @@
|
|
| 3 |
> Living reference of every module, class, function, and relationship.
|
| 4 |
> Updated after each implementation session.
|
| 5 |
>
|
| 6 |
-
> **Last updated:** 2026-03-07
|
| 7 |
|
| 8 |
## Module Index
|
| 9 |
|
|
@@ -13,7 +13,7 @@
|
|
| 13 |
| [scenarios.md](scenarios.md) | Scenario generation β templates, constraints, resources, hidden specs |
|
| 14 |
| [agents.md](agents.md) | Agent policies β scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
|
| 15 |
| [validation.md](validation.md) | Protocol validation β deterministic checks against scenario constraints |
|
| 16 |
-
| [scoring.md](scoring.md) | Judge scoring β rigor, feasibility, fidelity
|
| 17 |
| [server.md](server.md) | FastAPI server β REST + WebSocket endpoints, stub environment |
|
| 18 |
| [frontend.md](frontend.md) | React UI β dashboard, episode viewer, components |
|
| 19 |
| [config.md](config.md) | Shared constants β rounds, budget, timeouts |
|
|
@@ -47,10 +47,11 @@ replicalab/utils/validation.py
|
|
| 47 |
βββ replicalab.models (Protocol)
|
| 48 |
βββ replicalab.scenarios.templates (NormalizedScenarioPack)
|
| 49 |
|
| 50 |
-
replicalab/scoring/
|
| 51 |
βββ replicalab.models (Protocol, RewardBreakdown)
|
| 52 |
βββ replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
|
| 53 |
-
|
|
|
|
| 54 |
```
|
| 55 |
|
| 56 |
## File Tree (implemented only)
|
|
@@ -70,10 +71,14 @@ replicalab/
|
|
| 70 |
β βββ math_reasoning.py (2 cases: Cauchy-Schwarz, Jensen's inequality)
|
| 71 |
β βββ ml_benchmark.py (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
|
| 72 |
β βββ finance_trading.py (2 cases: SPY/QQQ mean-reversion, momentum futures)
|
| 73 |
-
βββ scoring/
|
| 74 |
-
β
|
|
|
|
|
|
|
|
|
|
| 75 |
βββ utils/
|
| 76 |
βββ seed.py (deterministic RNG from SHA256)
|
|
|
|
| 77 |
βββ validation.py (MOD 05: protocol validation, 5 checks)
|
| 78 |
|
| 79 |
server/
|
|
@@ -98,7 +103,9 @@ tests/
|
|
| 98 |
βββ test_scenarios.py (8 tests)
|
| 99 |
βββ test_validation.py (13 tests)
|
| 100 |
βββ test_scientist_policy.py (18 tests)
|
| 101 |
-
|
|
|
|
|
|
|
| 102 |
```
|
| 103 |
|
| 104 |
## Task Completion Status
|
|
@@ -108,7 +115,7 @@ tests/
|
|
| 108 |
| Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
|
| 109 |
| Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
|
| 110 |
| Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
|
| 111 |
-
| Judge (JDG) |
|
| 112 |
| Environment (ENV) | β | ENV 01-11 | Entire real environment |
|
| 113 |
| Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
|
| 114 |
| Frontend (FND) | FND 01-10 | β | Complete |
|
|
|
|
| 3 |
> Living reference of every module, class, function, and relationship.
|
| 4 |
> Updated after each implementation session.
|
| 5 |
>
|
| 6 |
+
> **Last updated:** 2026-03-07 (JDG 01-03 scoring implemented)
|
| 7 |
|
| 8 |
## Module Index
|
| 9 |
|
|
|
|
| 13 |
| [scenarios.md](scenarios.md) | Scenario generation β templates, constraints, resources, hidden specs |
|
| 14 |
| [agents.md](agents.md) | Agent policies β scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
|
| 15 |
| [validation.md](validation.md) | Protocol validation β deterministic checks against scenario constraints |
|
| 16 |
+
| [scoring.md](scoring.md) | Judge scoring β rigor, feasibility, fidelity |
|
| 17 |
| [server.md](server.md) | FastAPI server β REST + WebSocket endpoints, stub environment |
|
| 18 |
| [frontend.md](frontend.md) | React UI β dashboard, episode viewer, components |
|
| 19 |
| [config.md](config.md) | Shared constants β rounds, budget, timeouts |
|
|
|
|
| 47 |
βββ replicalab.models (Protocol)
|
| 48 |
βββ replicalab.scenarios.templates (NormalizedScenarioPack)
|
| 49 |
|
| 50 |
+
replicalab/scoring/
|
| 51 |
βββ replicalab.models (Protocol, RewardBreakdown)
|
| 52 |
βββ replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
|
| 53 |
+
βββ replicalab.agents.lab_manager_policy (check_feasibility, FeasibilityCheckResult)
|
| 54 |
+
βββ replicalab.utils.text (element_tokens, normalize_label)
|
| 55 |
```
|
| 56 |
|
| 57 |
## File Tree (implemented only)
|
|
|
|
| 71 |
β βββ math_reasoning.py (2 cases: Cauchy-Schwarz, Jensen's inequality)
|
| 72 |
β βββ ml_benchmark.py (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
|
| 73 |
β βββ finance_trading.py (2 cases: SPY/QQQ mean-reversion, momentum futures)
|
| 74 |
+
βββ scoring/
|
| 75 |
+
β βββ __init__.py (exports score_rigor, score_feasibility, score_fidelity)
|
| 76 |
+
β βββ rigor.py (JDG 01: structural quality + criteria coverage)
|
| 77 |
+
β βββ feasibility.py (JDG 02: wraps FeasibilityCheckResult with partial credit)
|
| 78 |
+
β βββ fidelity.py (JDG 03: substitution-aware hidden spec alignment)
|
| 79 |
βββ utils/
|
| 80 |
βββ seed.py (deterministic RNG from SHA256)
|
| 81 |
+
βββ text.py (shared token matching: normalize_label, element_tokens)
|
| 82 |
βββ validation.py (MOD 05: protocol validation, 5 checks)
|
| 83 |
|
| 84 |
server/
|
|
|
|
| 103 |
βββ test_scenarios.py (8 tests)
|
| 104 |
βββ test_validation.py (13 tests)
|
| 105 |
βββ test_scientist_policy.py (18 tests)
|
| 106 |
+
βββ test_lab_manager_policy.py(13 tests)
|
| 107 |
+
βββ test_reward.py (18 tests β JDG 01-03 scoring)
|
| 108 |
+
βββ test_server.py (5 tests β API endpoints)
|
| 109 |
```
|
| 110 |
|
| 111 |
## Task Completion Status
|
|
|
|
| 115 |
| Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
|
| 116 |
| Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
|
| 117 |
| Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
|
| 118 |
+
| Judge (JDG) | JDG 01-03 | JDG 04-08 | Reward composition, bonuses, penalties |
|
| 119 |
| Environment (ENV) | β | ENV 01-11 | Entire real environment |
|
| 120 |
| Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
|
| 121 |
| Frontend (FND) | FND 01-10 | β | Complete |
|
docs/map/scoring.md
CHANGED
|
@@ -1,57 +1,188 @@
|
|
| 1 |
# Scoring Map β `replicalab/scoring/`
|
| 2 |
|
| 3 |
> Judge scoring engine for protocol evaluation.
|
|
|
|
| 4 |
>
|
| 5 |
-
> **
|
| 6 |
-
> **Tasks remaining:** JDG
|
| 7 |
|
| 8 |
-
##
|
| 9 |
|
| 10 |
```
|
| 11 |
replicalab/scoring/
|
| 12 |
-
__init__.py # exports: score_rigor, score_feasibility, score_fidelity
|
| 13 |
rigor.py # JDG 01 β protocol structural quality
|
| 14 |
feasibility.py # JDG 02 β resource feasibility (wraps AGT 05)
|
| 15 |
fidelity.py # JDG 03 β adherence to hidden reference spec
|
| 16 |
```
|
| 17 |
|
| 18 |
-
##
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
Score range: [0.0, 1.0]
|
| 26 |
-
Measures: whether the lab can execute the protocol (budget, equipment, reagents, schedule, staff, policy).
|
| 27 |
-
Reuses `check_feasibility()` from AGT 05. Adds partial credit (continuous signal vs binary pass/fail).
|
| 28 |
|
| 29 |
-
|
| 30 |
-
Score range: [0.0, 1.0]
|
| 31 |
-
Measures: how closely the protocol matches `hidden_reference_spec`.
|
| 32 |
-
Substitution-aware β allowed substitutions get partial credit.
|
| 33 |
|
| 34 |
-
##
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Data Consumed
|
| 39 |
|
| 40 |
| Source | Used by | For what |
|
| 41 |
|--------|---------|----------|
|
| 42 |
-
| `Protocol` (models.py) | All scorers | The final agreed protocol |
|
| 43 |
-
| `NormalizedScenarioPack` (scenarios) | All scorers | Constraints, resources,
|
| 44 |
| `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
|
| 45 |
-
| `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks |
|
| 46 |
| `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
|
| 47 |
-
| `
|
| 48 |
|
| 49 |
-
##
|
| 50 |
|
| 51 |
-
|
| 52 |
-
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Scoring Map β `replicalab/scoring/`
|
| 2 |
|
| 3 |
> Judge scoring engine for protocol evaluation.
|
| 4 |
+
> Pure deterministic functions β no LLM calls, no side effects.
|
| 5 |
>
|
| 6 |
+
> **Tasks implemented:** JDG 01, JDG 02, JDG 03
|
| 7 |
+
> **Tasks remaining:** JDG 04-08
|
| 8 |
|
| 9 |
+
## Architecture
|
| 10 |
|
| 11 |
```
|
| 12 |
replicalab/scoring/
|
| 13 |
+
__init__.py # exports: score_rigor, score_feasibility, score_fidelity
|
| 14 |
rigor.py # JDG 01 β protocol structural quality
|
| 15 |
feasibility.py # JDG 02 β resource feasibility (wraps AGT 05)
|
| 16 |
fidelity.py # JDG 03 β adherence to hidden reference spec
|
| 17 |
```
|
| 18 |
|
| 19 |
+
## Shared Utilities
|
| 20 |
|
| 21 |
+
Token matching extracted into `replicalab/utils/text.py`:
|
| 22 |
+
- `normalize_label(label) -> str` β lowercase, strip, collapse whitespace
|
| 23 |
+
- `element_tokens(element) -> list[str]` β split into searchable tokens (3+ chars)
|
| 24 |
|
| 25 |
+
Used by: `validation.py`, `rigor.py`, `fidelity.py`
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
---
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
## JDG 01 β `score_rigor(protocol, scenario) -> float`
|
| 30 |
+
|
| 31 |
+
**File:** `rigor.py`
|
| 32 |
+
**Range:** [0.0, 1.0]
|
| 33 |
+
**Measures:** structural completeness and alignment to scenario requirements.
|
| 34 |
+
|
| 35 |
+
### Weight Breakdown
|
| 36 |
+
|
| 37 |
+
| Component | Weight | Method |
|
| 38 |
+
|-----------|--------|--------|
|
| 39 |
+
| Structural completeness | 0.30 | Field population checks |
|
| 40 |
+
| Success criteria coverage | 0.40 | Token match vs `scenario.success_criteria` |
|
| 41 |
+
| Required element coverage | 0.30 | Token match vs `hidden_reference_spec.required_elements` |
|
| 42 |
+
|
| 43 |
+
### Structural Completeness (0.30)
|
| 44 |
+
|
| 45 |
+
Each check contributes equally (0.05 each, total 0.35, then normalized):
|
| 46 |
+
|
| 47 |
+
| Check | Condition |
|
| 48 |
+
|-------|-----------|
|
| 49 |
+
| Sample size present | `sample_size >= 1` |
|
| 50 |
+
| Sample size meaningful | `sample_size >= 4` |
|
| 51 |
+
| Has control | `len(controls) >= 1` |
|
| 52 |
+
| Has second control | `len(controls) >= 2` |
|
| 53 |
+
| Technique specified | `technique` non-empty |
|
| 54 |
+
| Duration allocated | `duration_days >= 1` |
|
| 55 |
+
| Substantive rationale | `len(rationale) > 20` |
|
| 56 |
+
|
| 57 |
+
### Internal Functions
|
| 58 |
+
|
| 59 |
+
| Function | Purpose |
|
| 60 |
+
|----------|---------|
|
| 61 |
+
| `_structural_completeness(protocol)` | Field population score |
|
| 62 |
+
| `_success_criteria_coverage(protocol, scenario)` | Fraction of criteria matched |
|
| 63 |
+
| `_required_element_coverage(protocol, scenario)` | Fraction of elements matched |
|
| 64 |
+
| `_protocol_text_blob(protocol)` | Join all text fields for matching |
|
| 65 |
+
| `_text_matches(element, blob)` | Token overlap check |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## JDG 02 β `score_feasibility(protocol, scenario, check=None) -> float`
|
| 70 |
+
|
| 71 |
+
**File:** `feasibility.py`
|
| 72 |
+
**Range:** [0.0, 1.0]
|
| 73 |
+
**Measures:** whether the lab can execute this protocol.
|
| 74 |
+
|
| 75 |
+
### Key Design: No Rescoring
|
| 76 |
+
|
| 77 |
+
Does NOT recompute feasibility from scratch. Derives score from `FeasibilityCheckResult`
|
| 78 |
+
produced by AGT 05's `check_feasibility()`. If no pre-computed check is passed, calls
|
| 79 |
+
`check_feasibility()` internally. This prevents drift between Lab Manager grounding
|
| 80 |
+
and Judge scoring.
|
| 81 |
+
|
| 82 |
+
### Weight Breakdown
|
| 83 |
+
|
| 84 |
+
7 dimensions, each worth 1/7:
|
| 85 |
+
|
| 86 |
+
| Dimension | Type | Partial Credit Formula |
|
| 87 |
+
|-----------|------|----------------------|
|
| 88 |
+
| Protocol | Binary | 1.0 if ok, else 0.0 |
|
| 89 |
+
| Budget | Continuous | `min(1.0, budget_remaining / estimated_cost)` |
|
| 90 |
+
| Equipment | Continuous | fraction of required items that are available |
|
| 91 |
+
| Reagents | Continuous | fraction of required items that are in stock |
|
| 92 |
+
| Schedule | Binary | 1.0 if ok, else 0.0 |
|
| 93 |
+
| Staff | Continuous | `min(1.0, staff_count / required_staff)` |
|
| 94 |
+
| Policy | Binary | 1.0 if ok, else 0.0 (hard constraint) |
|
| 95 |
+
|
| 96 |
+
### Internal Functions
|
| 97 |
+
|
| 98 |
+
| Function | Purpose |
|
| 99 |
+
|----------|---------|
|
| 100 |
+
| `_budget_score(check, budget_remaining)` | Continuous budget ratio |
|
| 101 |
+
| `_staff_score(check, staff_count)` | Continuous staff ratio |
|
| 102 |
+
| `_fraction_score(required, available)` | Generic item-availability fraction |
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## JDG 03 β `score_fidelity(protocol, scenario) -> float`
|
| 107 |
+
|
| 108 |
+
**File:** `fidelity.py`
|
| 109 |
+
**Range:** [0.0, 1.0]
|
| 110 |
+
**Measures:** adherence to `hidden_reference_spec` (which the scientist never sees).
|
| 111 |
+
|
| 112 |
+
### Weight Breakdown
|
| 113 |
+
|
| 114 |
+
| Component | Weight | Method |
|
| 115 |
+
|-----------|--------|--------|
|
| 116 |
+
| Required element coverage | 0.50 | Substitution-aware token match |
|
| 117 |
+
| Flexible element alignment | 0.20 | Bonus only, no penalty |
|
| 118 |
+
| Target metric alignment | 0.20 | Token match vs metric + value |
|
| 119 |
+
| Technique appropriateness | 0.10 | Token match vs spec summary |
|
| 120 |
+
|
| 121 |
+
### Substitution-Aware Scoring
|
| 122 |
+
|
| 123 |
+
For required elements:
|
| 124 |
+
- **Direct match** (token in protocol text): 1.0 credit
|
| 125 |
+
- **Substitution match** (allowed alternative present): 0.7 credit
|
| 126 |
+
- **Miss**: 0.0 credit
|
| 127 |
+
|
| 128 |
+
This is the key difference from JDG 01's element check.
|
| 129 |
+
|
| 130 |
+
### Internal Functions
|
| 131 |
+
|
| 132 |
+
| Function | Purpose |
|
| 133 |
+
|----------|---------|
|
| 134 |
+
| `_required_element_score(elements, text, sub_index)` | Substitution-aware coverage |
|
| 135 |
+
| `_flexible_element_score(elements, text)` | Bonus-only coverage |
|
| 136 |
+
| `_target_metric_score(metric, value, text)` | Metric + value matching |
|
| 137 |
+
| `_technique_score(summary, text)` | Summary alignment |
|
| 138 |
+
| `_protocol_text_blob(protocol)` | Join text fields |
|
| 139 |
+
| `_text_matches(element, blob)` | Token overlap |
|
| 140 |
+
| `_substitution_matches(element, text, sub_index)` | Check alternatives |
|
| 141 |
+
| `_build_substitution_index(scenario)` | Map originals β alternatives |
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Not Yet Implemented
|
| 146 |
+
|
| 147 |
+
### `compute_reward(protocol, scenario, ...) -> RewardBreakdown` β JDG 04/05
|
| 148 |
+
Combines rigor + feasibility + fidelity with weights.
|
| 149 |
+
Applies efficiency bonus (rounds used), communication bonus, and penalties.
|
| 150 |
+
|
| 151 |
+
### Bonuses & Penalties β JDG 06-08
|
| 152 |
+
- `efficiency_bonus`: reward for finishing in fewer rounds
|
| 153 |
+
- `communication_bonus`: reward for clear negotiation
|
| 154 |
+
- `penalties`: policy violations, hallucinated resources, etc.
|
| 155 |
|
| 156 |
## Data Consumed
|
| 157 |
|
| 158 |
| Source | Used by | For what |
|
| 159 |
|--------|---------|----------|
|
| 160 |
+
| `Protocol` (models.py) | All 3 scorers | The final agreed protocol |
|
| 161 |
+
| `NormalizedScenarioPack` (scenarios) | All 3 scorers | Constraints, resources, criteria |
|
| 162 |
| `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
|
| 163 |
+
| `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks with partial credit |
|
| 164 |
| `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
|
| 165 |
+
| `element_tokens` (utils/text.py) | JDG 01, JDG 03 | Shared token extraction |
|
| 166 |
|
| 167 |
+
## Test Coverage β `tests/test_reward.py`
|
| 168 |
|
| 169 |
+
| Test | What it verifies |
|
| 170 |
+
|------|-----------------|
|
| 171 |
+
| `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering |
|
| 172 |
+
| `test_rigor_is_deterministic` | Same inputs β same output |
|
| 173 |
+
| `test_rigor_empty_controls_reduces_score` | Controls matter |
|
| 174 |
+
| `test_rigor_short_rationale_reduces_score` | Rationale length matters |
|
| 175 |
+
| `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations |
|
| 176 |
+
| `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 |
|
| 177 |
+
| `test_feasibility_infeasible_protocol_scores_lower` | Bad < good |
|
| 178 |
+
| `test_feasibility_accepts_precomputed_check` | Pre-computed = computed |
|
| 179 |
+
| `test_feasibility_is_deterministic` | Same inputs β same output |
|
| 180 |
+
| `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over |
|
| 181 |
+
| `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations |
|
| 182 |
+
| `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned |
|
| 183 |
+
| `test_fidelity_is_deterministic` | Same inputs β same output |
|
| 184 |
+
| `test_fidelity_substitution_gets_partial_credit` | Sub > miss |
|
| 185 |
+
| `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps |
|
| 186 |
+
| `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations |
|
| 187 |
+
| `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check |
|
| 188 |
+
| `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency |
|
docs/map/tests.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Tests Map β `tests/`
|
| 2 |
|
| 3 |
-
>
|
| 4 |
>
|
| 5 |
> **Last verified:** 2026-03-07
|
| 6 |
|
|
@@ -12,17 +12,17 @@
|
|
| 12 |
| `test_models.py` | 15 | All Pydantic model contracts |
|
| 13 |
| `test_scenarios.py` | 8 | Scenario generation and determinism |
|
| 14 |
| `test_validation.py` | 13 | Protocol validation checks |
|
| 15 |
-
| `test_scientist_policy.py` | 18 | Parser, retry, formatter, baseline |
|
| 16 |
| `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
|
| 17 |
-
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## Missing Coverage (not yet implemented)
|
| 20 |
|
| 21 |
| File (planned) | Would cover |
|
| 22 |
|---------------|-------------|
|
| 23 |
-
| `test_reward.py` | JDG 01-03 scoring functions |
|
| 24 |
| `test_env.py` | ENV 01-11 real environment |
|
| 25 |
-
| `test_server.py` | API endpoint integration tests |
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
| 1 |
# Tests Map β `tests/`
|
| 2 |
|
| 3 |
+
> 134 tests across 8 files. All passing.
|
| 4 |
>
|
| 5 |
> **Last verified:** 2026-03-07
|
| 6 |
|
|
|
|
| 12 |
| `test_models.py` | 15 | All Pydantic model contracts |
|
| 13 |
| `test_scenarios.py` | 8 | Scenario generation and determinism |
|
| 14 |
| `test_validation.py` | 13 | Protocol validation checks |
|
| 15 |
+
| `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
|
| 16 |
| `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
|
| 17 |
+
| `test_reward.py` | 18 | JDG 01-03 scoring functions |
|
| 18 |
+
| `test_server.py` | 5 | API endpoint integration |
|
| 19 |
+
| **Total** | **134** | |
|
| 20 |
|
| 21 |
## Missing Coverage (not yet implemented)
|
| 22 |
|
| 23 |
| File (planned) | Would cover |
|
| 24 |
|---------------|-------------|
|
|
|
|
| 25 |
| `test_env.py` | ENV 01-11 real environment |
|
|
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
replicalab/scoring/__init__.py
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Judge scoring engine β deterministic protocol evaluation."""
|
| 2 |
+
|
| 3 |
+
from .feasibility import score_feasibility
|
| 4 |
+
from .fidelity import score_fidelity
|
| 5 |
+
from .rigor import score_rigor
|
| 6 |
+
|
| 7 |
+
__all__ = [
|
| 8 |
+
"score_feasibility",
|
| 9 |
+
"score_fidelity",
|
| 10 |
+
"score_rigor",
|
| 11 |
+
]
|
replicalab/scoring/feasibility.py
ADDED
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""JDG 02 β Protocol feasibility score.
|
| 2 |
+
|
| 3 |
+
Derives a continuous [0, 1] signal from the existing FeasibilityCheckResult
|
| 4 |
+
produced by AGT 05. Does NOT rescore from scratch β this prevents drift
|
| 5 |
+
between Lab Manager grounding and Judge scoring.
|
| 6 |
+
|
| 7 |
+
Pure deterministic function β no LLM calls, no side effects.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from __future__ import annotations
|
| 11 |
+
|
| 12 |
+
from replicalab.agents.lab_manager_policy import (
|
| 13 |
+
FeasibilityCheckResult,
|
| 14 |
+
check_feasibility,
|
| 15 |
+
)
|
| 16 |
+
from replicalab.models import Protocol
|
| 17 |
+
from replicalab.scenarios.templates import NormalizedScenarioPack
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def score_feasibility(
|
| 21 |
+
protocol: Protocol,
|
| 22 |
+
scenario: NormalizedScenarioPack,
|
| 23 |
+
check: FeasibilityCheckResult | None = None,
|
| 24 |
+
) -> float:
|
| 25 |
+
"""Score resource feasibility as a continuous signal.
|
| 26 |
+
|
| 27 |
+
Returns a float in [0.0, 1.0].
|
| 28 |
+
|
| 29 |
+
If *check* is not provided, calls ``check_feasibility(protocol, scenario)``
|
| 30 |
+
to compute it. Passing a pre-computed check avoids redundant work.
|
| 31 |
+
|
| 32 |
+
Each of the 7 dimensions contributes equally (1/7). Binary dimensions
|
| 33 |
+
score 0 or 1. Continuous dimensions (budget, equipment, reagents, staff)
|
| 34 |
+
give partial credit based on how close the protocol is to passing.
|
| 35 |
+
"""
|
| 36 |
+
if check is None:
|
| 37 |
+
check = check_feasibility(protocol, scenario)
|
| 38 |
+
|
| 39 |
+
weight = 1.0 / 7.0
|
| 40 |
+
lab = scenario.lab_manager_observation
|
| 41 |
+
|
| 42 |
+
scores = [
|
| 43 |
+
# Protocol β binary
|
| 44 |
+
1.0 if check.protocol.ok else 0.0,
|
| 45 |
+
# Budget β partial credit
|
| 46 |
+
_budget_score(check, lab.budget_remaining),
|
| 47 |
+
# Equipment β partial credit
|
| 48 |
+
_fraction_score(
|
| 49 |
+
protocol.required_equipment,
|
| 50 |
+
set(lab.equipment_available),
|
| 51 |
+
) if not check.equipment.ok else 1.0,
|
| 52 |
+
# Reagents β partial credit
|
| 53 |
+
_fraction_score(
|
| 54 |
+
protocol.required_reagents,
|
| 55 |
+
set(lab.reagents_in_stock),
|
| 56 |
+
) if not check.reagents.ok else 1.0,
|
| 57 |
+
# Schedule β binary
|
| 58 |
+
1.0 if check.schedule.ok else 0.0,
|
| 59 |
+
# Staff β partial credit
|
| 60 |
+
_staff_score(check, lab.staff_count),
|
| 61 |
+
# Policy β binary (hard constraint)
|
| 62 |
+
1.0 if check.policy.ok else 0.0,
|
| 63 |
+
]
|
| 64 |
+
|
| 65 |
+
raw = sum(s * weight for s in scores)
|
| 66 |
+
return max(0.0, min(1.0, raw))
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
# ---------------------------------------------------------------------------
|
| 70 |
+
# Partial credit helpers
|
| 71 |
+
# ---------------------------------------------------------------------------
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def _budget_score(check: FeasibilityCheckResult, budget_remaining: float) -> float:
|
| 75 |
+
"""Continuous budget score: ratio of budget to estimated cost."""
|
| 76 |
+
if check.budget.ok:
|
| 77 |
+
return 1.0
|
| 78 |
+
if check.estimated_cost <= 0:
|
| 79 |
+
return 0.0
|
| 80 |
+
return max(0.0, min(1.0, budget_remaining / check.estimated_cost))
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def _staff_score(check: FeasibilityCheckResult, staff_count: int) -> float:
|
| 84 |
+
"""Continuous staff score: ratio of available staff to required."""
|
| 85 |
+
if check.staff.ok:
|
| 86 |
+
return 1.0
|
| 87 |
+
if check.required_staff <= 0:
|
| 88 |
+
return 0.0
|
| 89 |
+
return max(0.0, min(1.0, staff_count / check.required_staff))
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def _fraction_score(required: list[str], available: set[str]) -> float:
|
| 93 |
+
"""Fraction of required items that are in the available set."""
|
| 94 |
+
if not required:
|
| 95 |
+
return 1.0
|
| 96 |
+
available_lower = {item.lower().strip() for item in available}
|
| 97 |
+
matched = sum(1 for item in required if item.lower().strip() in available_lower)
|
| 98 |
+
return matched / len(required)
|
replicalab/scoring/fidelity.py
ADDED
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""JDG 03 β Protocol fidelity score.
|
| 2 |
+
|
| 3 |
+
Measures how closely the final protocol matches the hidden reference spec.
|
| 4 |
+
The scientist never sees the hidden spec β this score is for the judge only.
|
| 5 |
+
|
| 6 |
+
Substitution-aware: allowed substitutions get partial credit instead of 0.
|
| 7 |
+
|
| 8 |
+
Pure deterministic function β no LLM calls, no side effects.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
from replicalab.models import Protocol
|
| 14 |
+
from replicalab.scenarios.templates import NormalizedScenarioPack
|
| 15 |
+
from replicalab.utils.text import element_tokens, normalize_label
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def score_fidelity(
|
| 19 |
+
protocol: Protocol,
|
| 20 |
+
scenario: NormalizedScenarioPack,
|
| 21 |
+
) -> float:
|
| 22 |
+
"""Score adherence to the hidden reference specification.
|
| 23 |
+
|
| 24 |
+
Returns a float in [0.0, 1.0].
|
| 25 |
+
|
| 26 |
+
Breakdown (weights):
|
| 27 |
+
- Required element coverage: 0.50 (substitution-aware)
|
| 28 |
+
- Flexible element alignment: 0.20 (bonus, no penalty for missing)
|
| 29 |
+
- Target metric alignment: 0.20
|
| 30 |
+
- Technique appropriateness: 0.10
|
| 31 |
+
"""
|
| 32 |
+
spec = scenario.hidden_reference_spec
|
| 33 |
+
protocol_text = _protocol_text_blob(protocol)
|
| 34 |
+
sub_index = _build_substitution_index(scenario)
|
| 35 |
+
|
| 36 |
+
required = _required_element_score(spec.required_elements, protocol_text, sub_index)
|
| 37 |
+
flexible = _flexible_element_score(spec.flexible_elements, protocol_text)
|
| 38 |
+
target = _target_metric_score(spec.target_metric, spec.target_value, protocol_text)
|
| 39 |
+
technique = _technique_score(spec.summary, protocol_text)
|
| 40 |
+
|
| 41 |
+
raw = (0.50 * required) + (0.20 * flexible) + (0.20 * target) + (0.10 * technique)
|
| 42 |
+
return max(0.0, min(1.0, raw))
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# ---------------------------------------------------------------------------
|
| 46 |
+
# Sub-scores
|
| 47 |
+
# ---------------------------------------------------------------------------
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def _required_element_score(
|
| 51 |
+
required_elements: list[str],
|
| 52 |
+
protocol_text: str,
|
| 53 |
+
sub_index: dict[str, list[str]],
|
| 54 |
+
) -> float:
|
| 55 |
+
"""Score coverage of required elements, with substitution partial credit.
|
| 56 |
+
|
| 57 |
+
Direct match: 1.0 credit per element.
|
| 58 |
+
Substitution match: 0.7 credit (allowed substitution present instead).
|
| 59 |
+
Miss: 0.0 credit.
|
| 60 |
+
"""
|
| 61 |
+
if not required_elements:
|
| 62 |
+
return 1.0
|
| 63 |
+
|
| 64 |
+
total = 0.0
|
| 65 |
+
for element in required_elements:
|
| 66 |
+
if _text_matches(element, protocol_text):
|
| 67 |
+
total += 1.0
|
| 68 |
+
elif _substitution_matches(element, protocol_text, sub_index):
|
| 69 |
+
total += 0.7
|
| 70 |
+
# else: 0.0
|
| 71 |
+
|
| 72 |
+
return total / len(required_elements)
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def _flexible_element_score(
|
| 76 |
+
flexible_elements: list[str],
|
| 77 |
+
protocol_text: str,
|
| 78 |
+
) -> float:
|
| 79 |
+
"""Bonus for addressing flexible elements. No penalty for missing."""
|
| 80 |
+
if not flexible_elements:
|
| 81 |
+
return 1.0
|
| 82 |
+
|
| 83 |
+
matched = sum(1 for el in flexible_elements if _text_matches(el, protocol_text))
|
| 84 |
+
return matched / len(flexible_elements)
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def _target_metric_score(
|
| 88 |
+
target_metric: str,
|
| 89 |
+
target_value: str,
|
| 90 |
+
protocol_text: str,
|
| 91 |
+
) -> float:
|
| 92 |
+
"""Score for mentioning the target metric and value."""
|
| 93 |
+
score = 0.0
|
| 94 |
+
if _text_matches(target_metric, protocol_text):
|
| 95 |
+
score += 0.5
|
| 96 |
+
if _text_matches(target_value, protocol_text):
|
| 97 |
+
score += 0.5
|
| 98 |
+
return score
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def _technique_score(summary: str, protocol_text: str) -> float:
|
| 102 |
+
"""Score for technique alignment with the hidden spec summary."""
|
| 103 |
+
return 1.0 if _text_matches(summary, protocol_text) else 0.0
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
# ---------------------------------------------------------------------------
|
| 107 |
+
# Helpers
|
| 108 |
+
# ---------------------------------------------------------------------------
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def _protocol_text_blob(protocol: Protocol) -> str:
|
| 112 |
+
"""Join all protocol text fields into one lowercase blob for matching."""
|
| 113 |
+
return " ".join([
|
| 114 |
+
protocol.technique,
|
| 115 |
+
protocol.rationale,
|
| 116 |
+
" ".join(protocol.controls),
|
| 117 |
+
" ".join(protocol.required_equipment),
|
| 118 |
+
" ".join(protocol.required_reagents),
|
| 119 |
+
]).lower()
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def _text_matches(element: str, blob: str) -> bool:
|
| 123 |
+
"""Check if any token from *element* appears in *blob*."""
|
| 124 |
+
tokens = element_tokens(element)
|
| 125 |
+
return any(token in blob for token in tokens) if tokens else False
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def _substitution_matches(
|
| 129 |
+
element: str,
|
| 130 |
+
protocol_text: str,
|
| 131 |
+
sub_index: dict[str, list[str]],
|
| 132 |
+
) -> bool:
|
| 133 |
+
"""Check if an allowed substitution for *element* appears in the protocol."""
|
| 134 |
+
norm = normalize_label(element)
|
| 135 |
+
alternatives = sub_index.get(norm, [])
|
| 136 |
+
return any(_text_matches(alt, protocol_text) for alt in alternatives)
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
def _build_substitution_index(
|
| 140 |
+
scenario: NormalizedScenarioPack,
|
| 141 |
+
) -> dict[str, list[str]]:
|
| 142 |
+
"""Map normalized originals to their alternative labels."""
|
| 143 |
+
index: dict[str, list[str]] = {}
|
| 144 |
+
for sub in scenario.allowed_substitutions:
|
| 145 |
+
key = normalize_label(sub.original)
|
| 146 |
+
index.setdefault(key, []).append(sub.alternative)
|
| 147 |
+
return index
|
replicalab/scoring/rigor.py
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""JDG 01 β Protocol rigor score.
|
| 2 |
+
|
| 3 |
+
Measures structural completeness and alignment to scenario requirements.
|
| 4 |
+
Pure deterministic function β no LLM calls, no side effects.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
from replicalab.models import Protocol
|
| 10 |
+
from replicalab.scenarios.templates import NormalizedScenarioPack
|
| 11 |
+
from replicalab.utils.text import element_tokens
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def score_rigor(protocol: Protocol, scenario: NormalizedScenarioPack) -> float:
|
| 15 |
+
"""Score protocol structural quality and requirement coverage.
|
| 16 |
+
|
| 17 |
+
Returns a float in [0.0, 1.0].
|
| 18 |
+
|
| 19 |
+
Breakdown (weights):
|
| 20 |
+
- Structural completeness: 0.30
|
| 21 |
+
- Success criteria coverage: 0.40
|
| 22 |
+
- Required element coverage: 0.30
|
| 23 |
+
"""
|
| 24 |
+
completeness = _structural_completeness(protocol)
|
| 25 |
+
criteria = _success_criteria_coverage(protocol, scenario)
|
| 26 |
+
elements = _required_element_coverage(protocol, scenario)
|
| 27 |
+
|
| 28 |
+
raw = (0.30 * completeness) + (0.40 * criteria) + (0.30 * elements)
|
| 29 |
+
return max(0.0, min(1.0, raw))
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
# ---------------------------------------------------------------------------
|
| 33 |
+
# Sub-scores
|
| 34 |
+
# ---------------------------------------------------------------------------
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def _structural_completeness(protocol: Protocol) -> float:
|
| 38 |
+
"""Score based on whether protocol fields are meaningfully populated."""
|
| 39 |
+
score = 0.0
|
| 40 |
+
total = 0.35
|
| 41 |
+
|
| 42 |
+
# sample_size present
|
| 43 |
+
if protocol.sample_size >= 1:
|
| 44 |
+
score += 0.05
|
| 45 |
+
# sample_size statistically meaningful
|
| 46 |
+
if protocol.sample_size >= 4:
|
| 47 |
+
score += 0.05
|
| 48 |
+
# at least one control
|
| 49 |
+
if len(protocol.controls) >= 1:
|
| 50 |
+
score += 0.05
|
| 51 |
+
# second control (stronger design)
|
| 52 |
+
if len(protocol.controls) >= 2:
|
| 53 |
+
score += 0.05
|
| 54 |
+
# technique specified
|
| 55 |
+
if protocol.technique:
|
| 56 |
+
score += 0.05
|
| 57 |
+
# duration allocated
|
| 58 |
+
if protocol.duration_days >= 1:
|
| 59 |
+
score += 0.05
|
| 60 |
+
# rationale is substantive (not a placeholder)
|
| 61 |
+
if len(protocol.rationale) > 20:
|
| 62 |
+
score += 0.05
|
| 63 |
+
|
| 64 |
+
return score / total
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def _success_criteria_coverage(
|
| 68 |
+
protocol: Protocol,
|
| 69 |
+
scenario: NormalizedScenarioPack,
|
| 70 |
+
) -> float:
|
| 71 |
+
"""Fraction of scenario success criteria addressed by protocol text."""
|
| 72 |
+
criteria = scenario.success_criteria
|
| 73 |
+
if not criteria:
|
| 74 |
+
return 1.0
|
| 75 |
+
|
| 76 |
+
protocol_text = _protocol_text_blob(protocol)
|
| 77 |
+
matched = sum(1 for c in criteria if _text_matches(c, protocol_text))
|
| 78 |
+
return matched / len(criteria)
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def _required_element_coverage(
|
| 82 |
+
protocol: Protocol,
|
| 83 |
+
scenario: NormalizedScenarioPack,
|
| 84 |
+
) -> float:
|
| 85 |
+
"""Fraction of hidden reference required elements addressed."""
|
| 86 |
+
required = scenario.hidden_reference_spec.required_elements
|
| 87 |
+
if not required:
|
| 88 |
+
return 1.0
|
| 89 |
+
|
| 90 |
+
protocol_text = _protocol_text_blob(protocol)
|
| 91 |
+
matched = sum(1 for el in required if _text_matches(el, protocol_text))
|
| 92 |
+
return matched / len(required)
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
# ---------------------------------------------------------------------------
|
| 96 |
+
# Helpers
|
| 97 |
+
# ---------------------------------------------------------------------------
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def _protocol_text_blob(protocol: Protocol) -> str:
|
| 101 |
+
"""Join all protocol text fields into one lowercase blob for matching."""
|
| 102 |
+
return " ".join([
|
| 103 |
+
protocol.technique,
|
| 104 |
+
protocol.rationale,
|
| 105 |
+
" ".join(protocol.controls),
|
| 106 |
+
" ".join(protocol.required_equipment),
|
| 107 |
+
" ".join(protocol.required_reagents),
|
| 108 |
+
]).lower()
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def _text_matches(element: str, blob: str) -> bool:
|
| 112 |
+
"""Check if any token from *element* appears in *blob*."""
|
| 113 |
+
tokens = element_tokens(element)
|
| 114 |
+
return any(token in blob for token in tokens) if tokens else False
|
replicalab/utils/text.py
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Shared text-matching helpers used by validation and scoring modules."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def normalize_label(label: str) -> str:
|
| 7 |
+
"""Lowercase, strip, collapse whitespace for fuzzy matching."""
|
| 8 |
+
return " ".join(label.lower().split())
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def element_tokens(element: str) -> list[str]:
|
| 12 |
+
"""Split a required-element description into searchable tokens.
|
| 13 |
+
|
| 14 |
+
Returns individual significant words (>= 3 chars) so that
|
| 15 |
+
"transaction cost assumption" matches a rationale containing
|
| 16 |
+
"transaction" or "cost".
|
| 17 |
+
"""
|
| 18 |
+
return [word for word in element.lower().split() if len(word) >= 3]
|
replicalab/utils/validation.py
CHANGED
|
@@ -16,6 +16,8 @@ from pydantic import BaseModel, ConfigDict, Field
|
|
| 16 |
|
| 17 |
from replicalab.models import Protocol
|
| 18 |
from replicalab.scenarios.templates import NormalizedScenarioPack
|
|
|
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
# ---------------------------------------------------------------------------
|
|
@@ -264,25 +266,6 @@ def _check_required_element_coverage(
|
|
| 264 |
# ---------------------------------------------------------------------------
|
| 265 |
|
| 266 |
|
| 267 |
-
def _normalize(label: str) -> str:
|
| 268 |
-
"""Lowercase, strip, collapse whitespace for fuzzy matching."""
|
| 269 |
-
return " ".join(label.lower().split())
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
def _element_tokens(element: str) -> list[str]:
|
| 273 |
-
"""Split a required-element description into searchable tokens.
|
| 274 |
-
|
| 275 |
-
Returns individual significant words (>= 3 chars) so that
|
| 276 |
-
"transaction cost assumption" matches a rationale containing
|
| 277 |
-
"transaction" or "cost".
|
| 278 |
-
"""
|
| 279 |
-
return [
|
| 280 |
-
word
|
| 281 |
-
for word in element.lower().split()
|
| 282 |
-
if len(word) >= 3
|
| 283 |
-
]
|
| 284 |
-
|
| 285 |
-
|
| 286 |
def _substitution_alternatives(scenario: NormalizedScenarioPack) -> set[str]:
|
| 287 |
"""Return the normalized 'original' values from allowed substitutions."""
|
| 288 |
return {
|
|
|
|
| 16 |
|
| 17 |
from replicalab.models import Protocol
|
| 18 |
from replicalab.scenarios.templates import NormalizedScenarioPack
|
| 19 |
+
from replicalab.utils.text import element_tokens as _element_tokens
|
| 20 |
+
from replicalab.utils.text import normalize_label as _normalize
|
| 21 |
|
| 22 |
|
| 23 |
# ---------------------------------------------------------------------------
|
|
|
|
| 266 |
# ---------------------------------------------------------------------------
|
| 267 |
|
| 268 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
def _substitution_alternatives(scenario: NormalizedScenarioPack) -> set[str]:
|
| 270 |
"""Return the normalized 'original' values from allowed substitutions."""
|
| 271 |
return {
|
tests/test_reward.py
ADDED
|
@@ -0,0 +1,303 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for JDG 01β03 scoring functions."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from replicalab.agents.lab_manager_policy import check_feasibility
|
| 6 |
+
from replicalab.models import Protocol
|
| 7 |
+
from replicalab.scenarios import generate_scenario
|
| 8 |
+
from replicalab.scoring import score_feasibility, score_fidelity, score_rigor
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
# ---------------------------------------------------------------------------
|
| 12 |
+
# Helpers
|
| 13 |
+
# ---------------------------------------------------------------------------
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def _scenario(template: str = "ml_benchmark", difficulty: str = "easy"):
|
| 17 |
+
return generate_scenario(seed=42, template=template, difficulty=difficulty)
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def _good_protocol(scenario) -> Protocol:
|
| 21 |
+
"""Build a well-formed protocol aligned to the scenario."""
|
| 22 |
+
lab = scenario.lab_manager_observation
|
| 23 |
+
spec = scenario.hidden_reference_spec
|
| 24 |
+
return Protocol(
|
| 25 |
+
sample_size=10,
|
| 26 |
+
controls=["baseline", "ablation"],
|
| 27 |
+
technique=spec.summary[:60] if spec.summary else "replication_plan",
|
| 28 |
+
duration_days=max(1, min(2, lab.time_limit_days)),
|
| 29 |
+
required_equipment=(
|
| 30 |
+
list(lab.equipment_available[:1])
|
| 31 |
+
if lab.equipment_available
|
| 32 |
+
else []
|
| 33 |
+
),
|
| 34 |
+
required_reagents=(
|
| 35 |
+
list(lab.reagents_in_stock[:1])
|
| 36 |
+
if lab.reagents_in_stock
|
| 37 |
+
else []
|
| 38 |
+
),
|
| 39 |
+
rationale=(
|
| 40 |
+
f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
|
| 41 |
+
f"Target metric: {spec.target_metric}. "
|
| 42 |
+
f"Target value: {spec.target_value}. "
|
| 43 |
+
"Stay within budget and schedule."
|
| 44 |
+
),
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def _bad_protocol() -> Protocol:
|
| 49 |
+
"""Build a minimal protocol that misses most requirements."""
|
| 50 |
+
return Protocol(
|
| 51 |
+
sample_size=1,
|
| 52 |
+
controls=[],
|
| 53 |
+
technique="unknown_method",
|
| 54 |
+
duration_days=1,
|
| 55 |
+
required_equipment=[],
|
| 56 |
+
required_reagents=[],
|
| 57 |
+
rationale="No plan.",
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# ---------------------------------------------------------------------------
|
| 62 |
+
# JDG 01 β score_rigor
|
| 63 |
+
# ---------------------------------------------------------------------------
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def test_rigor_good_protocol_scores_higher_than_bad() -> None:
|
| 67 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 68 |
+
good = _good_protocol(scenario)
|
| 69 |
+
bad = _bad_protocol()
|
| 70 |
+
|
| 71 |
+
good_score = score_rigor(good, scenario)
|
| 72 |
+
bad_score = score_rigor(bad, scenario)
|
| 73 |
+
|
| 74 |
+
assert good_score > bad_score
|
| 75 |
+
assert 0.0 <= good_score <= 1.0
|
| 76 |
+
assert 0.0 <= bad_score <= 1.0
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def test_rigor_is_deterministic() -> None:
|
| 80 |
+
scenario = _scenario("ml_benchmark", "medium")
|
| 81 |
+
protocol = _good_protocol(scenario)
|
| 82 |
+
|
| 83 |
+
first = score_rigor(protocol, scenario)
|
| 84 |
+
second = score_rigor(protocol, scenario)
|
| 85 |
+
|
| 86 |
+
assert first == second
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def test_rigor_empty_controls_reduces_score() -> None:
|
| 90 |
+
scenario = _scenario("math_reasoning", "easy")
|
| 91 |
+
with_controls = _good_protocol(scenario)
|
| 92 |
+
without_controls = with_controls.model_copy(update={"controls": ["only_one"]})
|
| 93 |
+
|
| 94 |
+
score_with = score_rigor(with_controls, scenario)
|
| 95 |
+
score_without = score_rigor(without_controls, scenario)
|
| 96 |
+
|
| 97 |
+
assert score_with >= score_without
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def test_rigor_short_rationale_reduces_score() -> None:
|
| 101 |
+
scenario = _scenario("finance_trading", "easy")
|
| 102 |
+
good = _good_protocol(scenario)
|
| 103 |
+
short = good.model_copy(update={"rationale": "OK."})
|
| 104 |
+
|
| 105 |
+
assert score_rigor(good, scenario) > score_rigor(short, scenario)
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def test_rigor_all_domains_return_valid_range() -> None:
|
| 109 |
+
for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
|
| 110 |
+
for difficulty in ("easy", "medium", "hard"):
|
| 111 |
+
scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
|
| 112 |
+
protocol = _good_protocol(scenario)
|
| 113 |
+
score = score_rigor(protocol, scenario)
|
| 114 |
+
assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
# ---------------------------------------------------------------------------
|
| 118 |
+
# JDG 02 β score_feasibility
|
| 119 |
+
# ---------------------------------------------------------------------------
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def test_feasibility_viable_protocol_scores_high() -> None:
|
| 123 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 124 |
+
protocol = _good_protocol(scenario)
|
| 125 |
+
|
| 126 |
+
score = score_feasibility(protocol, scenario)
|
| 127 |
+
|
| 128 |
+
assert score > 0.7
|
| 129 |
+
assert 0.0 <= score <= 1.0
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def test_feasibility_infeasible_protocol_scores_lower() -> None:
|
| 133 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 134 |
+
good = _good_protocol(scenario)
|
| 135 |
+
# Blow the budget and schedule
|
| 136 |
+
bad = good.model_copy(update={
|
| 137 |
+
"sample_size": 200,
|
| 138 |
+
"duration_days": scenario.lab_manager_observation.time_limit_days + 5,
|
| 139 |
+
"required_equipment": ["Imaginary Device"],
|
| 140 |
+
})
|
| 141 |
+
|
| 142 |
+
good_score = score_feasibility(good, scenario)
|
| 143 |
+
bad_score = score_feasibility(bad, scenario)
|
| 144 |
+
|
| 145 |
+
assert good_score > bad_score
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
def test_feasibility_accepts_precomputed_check() -> None:
|
| 149 |
+
scenario = _scenario("finance_trading", "easy")
|
| 150 |
+
protocol = _good_protocol(scenario)
|
| 151 |
+
check = check_feasibility(protocol, scenario)
|
| 152 |
+
|
| 153 |
+
score_with = score_feasibility(protocol, scenario, check=check)
|
| 154 |
+
score_without = score_feasibility(protocol, scenario)
|
| 155 |
+
|
| 156 |
+
assert score_with == score_without
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
def test_feasibility_is_deterministic() -> None:
|
| 160 |
+
scenario = _scenario("math_reasoning", "medium")
|
| 161 |
+
protocol = _good_protocol(scenario)
|
| 162 |
+
|
| 163 |
+
first = score_feasibility(protocol, scenario)
|
| 164 |
+
second = score_feasibility(protocol, scenario)
|
| 165 |
+
|
| 166 |
+
assert first == second
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def test_feasibility_partial_credit_for_near_budget() -> None:
|
| 170 |
+
"""A protocol slightly over budget should score higher than one far over."""
|
| 171 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 172 |
+
good = _good_protocol(scenario)
|
| 173 |
+
|
| 174 |
+
slightly_over = good.model_copy(update={"sample_size": 40})
|
| 175 |
+
far_over = good.model_copy(update={"sample_size": 200})
|
| 176 |
+
|
| 177 |
+
score_slight = score_feasibility(slightly_over, scenario)
|
| 178 |
+
score_far = score_feasibility(far_over, scenario)
|
| 179 |
+
|
| 180 |
+
assert score_slight >= score_far
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
def test_feasibility_all_domains_return_valid_range() -> None:
|
| 184 |
+
for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
|
| 185 |
+
for difficulty in ("easy", "medium", "hard"):
|
| 186 |
+
scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
|
| 187 |
+
protocol = _good_protocol(scenario)
|
| 188 |
+
score = score_feasibility(protocol, scenario)
|
| 189 |
+
assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
# ---------------------------------------------------------------------------
|
| 193 |
+
# JDG 03 β score_fidelity
|
| 194 |
+
# ---------------------------------------------------------------------------
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def test_fidelity_aligned_protocol_scores_higher() -> None:
|
| 198 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 199 |
+
aligned = _good_protocol(scenario)
|
| 200 |
+
misaligned = _bad_protocol()
|
| 201 |
+
|
| 202 |
+
aligned_score = score_fidelity(aligned, scenario)
|
| 203 |
+
misaligned_score = score_fidelity(misaligned, scenario)
|
| 204 |
+
|
| 205 |
+
assert aligned_score > misaligned_score
|
| 206 |
+
assert 0.0 <= aligned_score <= 1.0
|
| 207 |
+
assert 0.0 <= misaligned_score <= 1.0
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def test_fidelity_is_deterministic() -> None:
|
| 211 |
+
scenario = _scenario("finance_trading", "hard")
|
| 212 |
+
protocol = _good_protocol(scenario)
|
| 213 |
+
|
| 214 |
+
first = score_fidelity(protocol, scenario)
|
| 215 |
+
second = score_fidelity(protocol, scenario)
|
| 216 |
+
|
| 217 |
+
assert first == second
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def test_fidelity_substitution_gets_partial_credit() -> None:
|
| 221 |
+
"""Using an allowed substitution should score better than a total miss."""
|
| 222 |
+
scenario = _scenario("math_reasoning", "easy")
|
| 223 |
+
spec = scenario.hidden_reference_spec
|
| 224 |
+
|
| 225 |
+
# Find a required element that has a substitution
|
| 226 |
+
sub_map = {}
|
| 227 |
+
for sub in scenario.allowed_substitutions:
|
| 228 |
+
sub_map[sub.original.lower()] = sub.alternative
|
| 229 |
+
|
| 230 |
+
if not sub_map or not spec.required_elements:
|
| 231 |
+
return # skip if no substitution exists in this scenario
|
| 232 |
+
|
| 233 |
+
# Build protocol that uses the substitution alternative
|
| 234 |
+
first_sub_original = list(sub_map.keys())[0]
|
| 235 |
+
first_sub_alt = sub_map[first_sub_original]
|
| 236 |
+
|
| 237 |
+
with_sub = _good_protocol(scenario).model_copy(update={
|
| 238 |
+
"rationale": f"We will use {first_sub_alt} instead. " + spec.target_metric,
|
| 239 |
+
})
|
| 240 |
+
without_anything = _bad_protocol()
|
| 241 |
+
|
| 242 |
+
score_sub = score_fidelity(with_sub, scenario)
|
| 243 |
+
score_miss = score_fidelity(without_anything, scenario)
|
| 244 |
+
|
| 245 |
+
assert score_sub > score_miss
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
def test_fidelity_mentioning_target_metric_improves_score() -> None:
|
| 249 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 250 |
+
spec = scenario.hidden_reference_spec
|
| 251 |
+
|
| 252 |
+
with_metric = _good_protocol(scenario)
|
| 253 |
+
without_metric = with_metric.model_copy(update={
|
| 254 |
+
"rationale": "Generic plan without any specific metric mentioned.",
|
| 255 |
+
})
|
| 256 |
+
|
| 257 |
+
score_with = score_fidelity(with_metric, scenario)
|
| 258 |
+
score_without = score_fidelity(without_metric, scenario)
|
| 259 |
+
|
| 260 |
+
assert score_with >= score_without
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
def test_fidelity_all_domains_return_valid_range() -> None:
|
| 264 |
+
for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
|
| 265 |
+
for difficulty in ("easy", "medium", "hard"):
|
| 266 |
+
scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
|
| 267 |
+
protocol = _good_protocol(scenario)
|
| 268 |
+
score = score_fidelity(protocol, scenario)
|
| 269 |
+
assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
# ---------------------------------------------------------------------------
|
| 273 |
+
# Cross-scorer consistency
|
| 274 |
+
# ---------------------------------------------------------------------------
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
def test_all_scores_between_zero_and_one_for_bad_protocol() -> None:
|
| 278 |
+
for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
|
| 279 |
+
scenario = generate_scenario(seed=7, template=template, difficulty="hard")
|
| 280 |
+
bad = _bad_protocol()
|
| 281 |
+
|
| 282 |
+
r = score_rigor(bad, scenario)
|
| 283 |
+
fe = score_feasibility(bad, scenario)
|
| 284 |
+
fi = score_fidelity(bad, scenario)
|
| 285 |
+
|
| 286 |
+
assert 0.0 <= r <= 1.0, f"rigor {template}: {r}"
|
| 287 |
+
assert 0.0 <= fe <= 1.0, f"feasibility {template}: {fe}"
|
| 288 |
+
assert 0.0 <= fi <= 1.0, f"fidelity {template}: {fi}"
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
def test_good_protocol_dominates_bad_on_rigor_and_fidelity() -> None:
|
| 292 |
+
"""Good protocol beats bad on rigor and fidelity.
|
| 293 |
+
|
| 294 |
+
Feasibility is excluded: a protocol that asks for nothing is trivially
|
| 295 |
+
feasible (no equipment, no reagents β nothing can fail). The other two
|
| 296 |
+
scores correctly penalize an empty plan.
|
| 297 |
+
"""
|
| 298 |
+
scenario = _scenario("ml_benchmark", "easy")
|
| 299 |
+
good = _good_protocol(scenario)
|
| 300 |
+
bad = _bad_protocol()
|
| 301 |
+
|
| 302 |
+
assert score_rigor(good, scenario) > score_rigor(bad, scenario)
|
| 303 |
+
assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)
|