ayushozha Claude Opus 4.6 commited on
Commit
e50dca9
Β·
1 Parent(s): 0ed9084

Add deterministic judge scoring engine (JDG 01-03)

Browse files

JDG 01 (rigor): structural completeness + success criteria + required
element coverage. Weighted sub-scores with token matching.

JDG 02 (feasibility): continuous signal from FeasibilityCheckResult
with partial credit for budget, equipment, reagents, and staff.
Does not rescore β€” wraps AGT 05 to prevent drift.

JDG 03 (fidelity): substitution-aware adherence to hidden reference
spec. Direct match = 1.0, allowed substitution = 0.7, miss = 0.0.

Shared text helpers extracted to replicalab/utils/text.py.
18 new tests covering all 3 scorers across all domains and difficulties.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs/map/README.md CHANGED
@@ -3,7 +3,7 @@
3
  > Living reference of every module, class, function, and relationship.
4
  > Updated after each implementation session.
5
  >
6
- > **Last updated:** 2026-03-07
7
 
8
  ## Module Index
9
 
@@ -13,7 +13,7 @@
13
  | [scenarios.md](scenarios.md) | Scenario generation β€” templates, constraints, resources, hidden specs |
14
  | [agents.md](agents.md) | Agent policies β€” scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
15
  | [validation.md](validation.md) | Protocol validation β€” deterministic checks against scenario constraints |
16
- | [scoring.md](scoring.md) | Judge scoring β€” rigor, feasibility, fidelity (NOT YET IMPLEMENTED) |
17
  | [server.md](server.md) | FastAPI server β€” REST + WebSocket endpoints, stub environment |
18
  | [frontend.md](frontend.md) | React UI β€” dashboard, episode viewer, components |
19
  | [config.md](config.md) | Shared constants β€” rounds, budget, timeouts |
@@ -47,10 +47,11 @@ replicalab/utils/validation.py
47
  β”œβ”€β”€ replicalab.models (Protocol)
48
  └── replicalab.scenarios.templates (NormalizedScenarioPack)
49
 
50
- replicalab/scoring/ <-- NOT YET IMPLEMENTED
51
  β”œβ”€β”€ replicalab.models (Protocol, RewardBreakdown)
52
  β”œβ”€β”€ replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
53
- └── replicalab.agents.lab_manager_policy (check_feasibility, FeasibilityCheckResult)
 
54
  ```
55
 
56
  ## File Tree (implemented only)
@@ -70,10 +71,14 @@ replicalab/
70
  β”‚ β”œβ”€β”€ math_reasoning.py (2 cases: Cauchy-Schwarz, Jensen's inequality)
71
  β”‚ β”œβ”€β”€ ml_benchmark.py (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
72
  β”‚ └── finance_trading.py (2 cases: SPY/QQQ mean-reversion, momentum futures)
73
- β”œβ”€β”€ scoring/ <-- EMPTY (JDG 01-03 not yet built)
74
- β”‚ └── .gitkeep
 
 
 
75
  └── utils/
76
  β”œβ”€β”€ seed.py (deterministic RNG from SHA256)
 
77
  └── validation.py (MOD 05: protocol validation, 5 checks)
78
 
79
  server/
@@ -98,7 +103,9 @@ tests/
98
  β”œβ”€β”€ test_scenarios.py (8 tests)
99
  β”œβ”€β”€ test_validation.py (13 tests)
100
  β”œβ”€β”€ test_scientist_policy.py (18 tests)
101
- └── test_lab_manager_policy.py(13 tests)
 
 
102
  ```
103
 
104
  ## Task Completion Status
@@ -108,7 +115,7 @@ tests/
108
  | Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
109
  | Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
110
  | Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
111
- | Judge (JDG) | β€” | JDG 01-08 | Entire scoring engine |
112
  | Environment (ENV) | β€” | ENV 01-11 | Entire real environment |
113
  | Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
114
  | Frontend (FND) | FND 01-10 | β€” | Complete |
 
3
  > Living reference of every module, class, function, and relationship.
4
  > Updated after each implementation session.
5
  >
6
+ > **Last updated:** 2026-03-07 (JDG 01-03 scoring implemented)
7
 
8
  ## Module Index
9
 
 
13
  | [scenarios.md](scenarios.md) | Scenario generation β€” templates, constraints, resources, hidden specs |
14
  | [agents.md](agents.md) | Agent policies β€” scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
15
  | [validation.md](validation.md) | Protocol validation β€” deterministic checks against scenario constraints |
16
+ | [scoring.md](scoring.md) | Judge scoring β€” rigor, feasibility, fidelity |
17
  | [server.md](server.md) | FastAPI server β€” REST + WebSocket endpoints, stub environment |
18
  | [frontend.md](frontend.md) | React UI β€” dashboard, episode viewer, components |
19
  | [config.md](config.md) | Shared constants β€” rounds, budget, timeouts |
 
47
  β”œβ”€β”€ replicalab.models (Protocol)
48
  └── replicalab.scenarios.templates (NormalizedScenarioPack)
49
 
50
+ replicalab/scoring/
51
  β”œβ”€β”€ replicalab.models (Protocol, RewardBreakdown)
52
  β”œβ”€β”€ replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
53
+ β”œβ”€β”€ replicalab.agents.lab_manager_policy (check_feasibility, FeasibilityCheckResult)
54
+ └── replicalab.utils.text (element_tokens, normalize_label)
55
  ```
56
 
57
  ## File Tree (implemented only)
 
71
  β”‚ β”œβ”€β”€ math_reasoning.py (2 cases: Cauchy-Schwarz, Jensen's inequality)
72
  β”‚ β”œβ”€β”€ ml_benchmark.py (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
73
  β”‚ └── finance_trading.py (2 cases: SPY/QQQ mean-reversion, momentum futures)
74
+ β”œβ”€β”€ scoring/
75
+ β”‚ β”œβ”€β”€ __init__.py (exports score_rigor, score_feasibility, score_fidelity)
76
+ β”‚ β”œβ”€β”€ rigor.py (JDG 01: structural quality + criteria coverage)
77
+ β”‚ β”œβ”€β”€ feasibility.py (JDG 02: wraps FeasibilityCheckResult with partial credit)
78
+ β”‚ └── fidelity.py (JDG 03: substitution-aware hidden spec alignment)
79
  └── utils/
80
  β”œβ”€β”€ seed.py (deterministic RNG from SHA256)
81
+ β”œβ”€β”€ text.py (shared token matching: normalize_label, element_tokens)
82
  └── validation.py (MOD 05: protocol validation, 5 checks)
83
 
84
  server/
 
103
  β”œβ”€β”€ test_scenarios.py (8 tests)
104
  β”œβ”€β”€ test_validation.py (13 tests)
105
  β”œβ”€β”€ test_scientist_policy.py (18 tests)
106
+ β”œβ”€β”€ test_lab_manager_policy.py(13 tests)
107
+ β”œβ”€β”€ test_reward.py (18 tests β€” JDG 01-03 scoring)
108
+ └── test_server.py (5 tests β€” API endpoints)
109
  ```
110
 
111
  ## Task Completion Status
 
115
  | Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
116
  | Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
117
  | Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
118
+ | Judge (JDG) | JDG 01-03 | JDG 04-08 | Reward composition, bonuses, penalties |
119
  | Environment (ENV) | β€” | ENV 01-11 | Entire real environment |
120
  | Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
121
  | Frontend (FND) | FND 01-10 | β€” | Complete |
docs/map/scoring.md CHANGED
@@ -1,57 +1,188 @@
1
  # Scoring Map β€” `replicalab/scoring/`
2
 
3
  > Judge scoring engine for protocol evaluation.
 
4
  >
5
- > **Status:** NOT YET IMPLEMENTED
6
- > **Tasks remaining:** JDG 01-08
7
 
8
- ## Planned Architecture
9
 
10
  ```
11
  replicalab/scoring/
12
- __init__.py # exports: score_rigor, score_feasibility, score_fidelity, compute_reward
13
  rigor.py # JDG 01 β€” protocol structural quality
14
  feasibility.py # JDG 02 β€” resource feasibility (wraps AGT 05)
15
  fidelity.py # JDG 03 β€” adherence to hidden reference spec
16
  ```
17
 
18
- ## Planned Functions
19
 
20
- ### `score_rigor(protocol, scenario) -> float` β€” JDG 01
21
- Score range: [0.0, 1.0]
22
- Measures: structural completeness, success criteria coverage, required element coverage.
23
 
24
- ### `score_feasibility(protocol, scenario, check_result=None) -> float` β€” JDG 02
25
- Score range: [0.0, 1.0]
26
- Measures: whether the lab can execute the protocol (budget, equipment, reagents, schedule, staff, policy).
27
- Reuses `check_feasibility()` from AGT 05. Adds partial credit (continuous signal vs binary pass/fail).
28
 
29
- ### `score_fidelity(protocol, scenario) -> float` β€” JDG 03
30
- Score range: [0.0, 1.0]
31
- Measures: how closely the protocol matches `hidden_reference_spec`.
32
- Substitution-aware β€” allowed substitutions get partial credit.
33
 
34
- ### `compute_reward(protocol, scenario, check_result=None) -> RewardBreakdown` β€” JDG 04/05
35
- Combines rigor + feasibility + fidelity into `RewardBreakdown`.
36
- Applies efficiency bonus, communication bonus, and penalties.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Data Consumed
39
 
40
  | Source | Used by | For what |
41
  |--------|---------|----------|
42
- | `Protocol` (models.py) | All scorers | The final agreed protocol |
43
- | `NormalizedScenarioPack` (scenarios) | All scorers | Constraints, resources, success criteria |
44
  | `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
45
- | `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks |
46
  | `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
47
- | `RewardBreakdown` (models.py) | JDG 04/05 | Output container |
48
 
49
- ## Data Produced
50
 
51
- `RewardBreakdown` populates:
52
- - `rigor: float` β€” from JDG 01
53
- - `feasibility: float` β€” from JDG 02
54
- - `fidelity: float` β€” from JDG 03
55
- - `efficiency_bonus: float` β€” from JDG 04 (rounds used / max rounds)
56
- - `communication_bonus: float` β€” from JDG 05 (negotiation quality)
57
- - `penalties: dict[str, float]` β€” from JDG 06-08 (policy violations, etc.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Scoring Map β€” `replicalab/scoring/`
2
 
3
  > Judge scoring engine for protocol evaluation.
4
+ > Pure deterministic functions β€” no LLM calls, no side effects.
5
  >
6
+ > **Tasks implemented:** JDG 01, JDG 02, JDG 03
7
+ > **Tasks remaining:** JDG 04-08
8
 
9
+ ## Architecture
10
 
11
  ```
12
  replicalab/scoring/
13
+ __init__.py # exports: score_rigor, score_feasibility, score_fidelity
14
  rigor.py # JDG 01 β€” protocol structural quality
15
  feasibility.py # JDG 02 β€” resource feasibility (wraps AGT 05)
16
  fidelity.py # JDG 03 β€” adherence to hidden reference spec
17
  ```
18
 
19
+ ## Shared Utilities
20
 
21
+ Token matching extracted into `replicalab/utils/text.py`:
22
+ - `normalize_label(label) -> str` β€” lowercase, strip, collapse whitespace
23
+ - `element_tokens(element) -> list[str]` β€” split into searchable tokens (3+ chars)
24
 
25
+ Used by: `validation.py`, `rigor.py`, `fidelity.py`
 
 
 
26
 
27
+ ---
 
 
 
28
 
29
+ ## JDG 01 β€” `score_rigor(protocol, scenario) -> float`
30
+
31
+ **File:** `rigor.py`
32
+ **Range:** [0.0, 1.0]
33
+ **Measures:** structural completeness and alignment to scenario requirements.
34
+
35
+ ### Weight Breakdown
36
+
37
+ | Component | Weight | Method |
38
+ |-----------|--------|--------|
39
+ | Structural completeness | 0.30 | Field population checks |
40
+ | Success criteria coverage | 0.40 | Token match vs `scenario.success_criteria` |
41
+ | Required element coverage | 0.30 | Token match vs `hidden_reference_spec.required_elements` |
42
+
43
+ ### Structural Completeness (0.30)
44
+
45
+ Each check contributes equally (0.05 each, total 0.35, then normalized):
46
+
47
+ | Check | Condition |
48
+ |-------|-----------|
49
+ | Sample size present | `sample_size >= 1` |
50
+ | Sample size meaningful | `sample_size >= 4` |
51
+ | Has control | `len(controls) >= 1` |
52
+ | Has second control | `len(controls) >= 2` |
53
+ | Technique specified | `technique` non-empty |
54
+ | Duration allocated | `duration_days >= 1` |
55
+ | Substantive rationale | `len(rationale) > 20` |
56
+
57
+ ### Internal Functions
58
+
59
+ | Function | Purpose |
60
+ |----------|---------|
61
+ | `_structural_completeness(protocol)` | Field population score |
62
+ | `_success_criteria_coverage(protocol, scenario)` | Fraction of criteria matched |
63
+ | `_required_element_coverage(protocol, scenario)` | Fraction of elements matched |
64
+ | `_protocol_text_blob(protocol)` | Join all text fields for matching |
65
+ | `_text_matches(element, blob)` | Token overlap check |
66
+
67
+ ---
68
+
69
+ ## JDG 02 β€” `score_feasibility(protocol, scenario, check=None) -> float`
70
+
71
+ **File:** `feasibility.py`
72
+ **Range:** [0.0, 1.0]
73
+ **Measures:** whether the lab can execute this protocol.
74
+
75
+ ### Key Design: No Rescoring
76
+
77
+ Does NOT recompute feasibility from scratch. Derives score from `FeasibilityCheckResult`
78
+ produced by AGT 05's `check_feasibility()`. If no pre-computed check is passed, calls
79
+ `check_feasibility()` internally. This prevents drift between Lab Manager grounding
80
+ and Judge scoring.
81
+
82
+ ### Weight Breakdown
83
+
84
+ 7 dimensions, each worth 1/7:
85
+
86
+ | Dimension | Type | Partial Credit Formula |
87
+ |-----------|------|----------------------|
88
+ | Protocol | Binary | 1.0 if ok, else 0.0 |
89
+ | Budget | Continuous | `min(1.0, budget_remaining / estimated_cost)` |
90
+ | Equipment | Continuous | fraction of required items that are available |
91
+ | Reagents | Continuous | fraction of required items that are in stock |
92
+ | Schedule | Binary | 1.0 if ok, else 0.0 |
93
+ | Staff | Continuous | `min(1.0, staff_count / required_staff)` |
94
+ | Policy | Binary | 1.0 if ok, else 0.0 (hard constraint) |
95
+
96
+ ### Internal Functions
97
+
98
+ | Function | Purpose |
99
+ |----------|---------|
100
+ | `_budget_score(check, budget_remaining)` | Continuous budget ratio |
101
+ | `_staff_score(check, staff_count)` | Continuous staff ratio |
102
+ | `_fraction_score(required, available)` | Generic item-availability fraction |
103
+
104
+ ---
105
+
106
+ ## JDG 03 β€” `score_fidelity(protocol, scenario) -> float`
107
+
108
+ **File:** `fidelity.py`
109
+ **Range:** [0.0, 1.0]
110
+ **Measures:** adherence to `hidden_reference_spec` (which the scientist never sees).
111
+
112
+ ### Weight Breakdown
113
+
114
+ | Component | Weight | Method |
115
+ |-----------|--------|--------|
116
+ | Required element coverage | 0.50 | Substitution-aware token match |
117
+ | Flexible element alignment | 0.20 | Bonus only, no penalty |
118
+ | Target metric alignment | 0.20 | Token match vs metric + value |
119
+ | Technique appropriateness | 0.10 | Token match vs spec summary |
120
+
121
+ ### Substitution-Aware Scoring
122
+
123
+ For required elements:
124
+ - **Direct match** (token in protocol text): 1.0 credit
125
+ - **Substitution match** (allowed alternative present): 0.7 credit
126
+ - **Miss**: 0.0 credit
127
+
128
+ This is the key difference from JDG 01's element check.
129
+
130
+ ### Internal Functions
131
+
132
+ | Function | Purpose |
133
+ |----------|---------|
134
+ | `_required_element_score(elements, text, sub_index)` | Substitution-aware coverage |
135
+ | `_flexible_element_score(elements, text)` | Bonus-only coverage |
136
+ | `_target_metric_score(metric, value, text)` | Metric + value matching |
137
+ | `_technique_score(summary, text)` | Summary alignment |
138
+ | `_protocol_text_blob(protocol)` | Join text fields |
139
+ | `_text_matches(element, blob)` | Token overlap |
140
+ | `_substitution_matches(element, text, sub_index)` | Check alternatives |
141
+ | `_build_substitution_index(scenario)` | Map originals β†’ alternatives |
142
+
143
+ ---
144
+
145
+ ## Not Yet Implemented
146
+
147
+ ### `compute_reward(protocol, scenario, ...) -> RewardBreakdown` β€” JDG 04/05
148
+ Combines rigor + feasibility + fidelity with weights.
149
+ Applies efficiency bonus (rounds used), communication bonus, and penalties.
150
+
151
+ ### Bonuses & Penalties β€” JDG 06-08
152
+ - `efficiency_bonus`: reward for finishing in fewer rounds
153
+ - `communication_bonus`: reward for clear negotiation
154
+ - `penalties`: policy violations, hallucinated resources, etc.
155
 
156
  ## Data Consumed
157
 
158
  | Source | Used by | For what |
159
  |--------|---------|----------|
160
+ | `Protocol` (models.py) | All 3 scorers | The final agreed protocol |
161
+ | `NormalizedScenarioPack` (scenarios) | All 3 scorers | Constraints, resources, criteria |
162
  | `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
163
+ | `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks with partial credit |
164
  | `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
165
+ | `element_tokens` (utils/text.py) | JDG 01, JDG 03 | Shared token extraction |
166
 
167
+ ## Test Coverage β€” `tests/test_reward.py`
168
 
169
+ | Test | What it verifies |
170
+ |------|-----------------|
171
+ | `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering |
172
+ | `test_rigor_is_deterministic` | Same inputs β†’ same output |
173
+ | `test_rigor_empty_controls_reduces_score` | Controls matter |
174
+ | `test_rigor_short_rationale_reduces_score` | Rationale length matters |
175
+ | `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations |
176
+ | `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 |
177
+ | `test_feasibility_infeasible_protocol_scores_lower` | Bad < good |
178
+ | `test_feasibility_accepts_precomputed_check` | Pre-computed = computed |
179
+ | `test_feasibility_is_deterministic` | Same inputs β†’ same output |
180
+ | `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over |
181
+ | `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations |
182
+ | `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned |
183
+ | `test_fidelity_is_deterministic` | Same inputs β†’ same output |
184
+ | `test_fidelity_substitution_gets_partial_credit` | Sub > miss |
185
+ | `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps |
186
+ | `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations |
187
+ | `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check |
188
+ | `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency |
docs/map/tests.md CHANGED
@@ -1,6 +1,6 @@
1
  # Tests Map β€” `tests/`
2
 
3
- > 87 tests across 6 files. All passing.
4
  >
5
  > **Last verified:** 2026-03-07
6
 
@@ -12,17 +12,17 @@
12
  | `test_models.py` | 15 | All Pydantic model contracts |
13
  | `test_scenarios.py` | 8 | Scenario generation and determinism |
14
  | `test_validation.py` | 13 | Protocol validation checks |
15
- | `test_scientist_policy.py` | 18 | Parser, retry, formatter, baseline |
16
  | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
17
- | **Total** | **87** | |
 
 
18
 
19
  ## Missing Coverage (not yet implemented)
20
 
21
  | File (planned) | Would cover |
22
  |---------------|-------------|
23
- | `test_reward.py` | JDG 01-03 scoring functions |
24
  | `test_env.py` | ENV 01-11 real environment |
25
- | `test_server.py` | API endpoint integration tests |
26
 
27
  ---
28
 
 
1
  # Tests Map β€” `tests/`
2
 
3
+ > 134 tests across 8 files. All passing.
4
  >
5
  > **Last verified:** 2026-03-07
6
 
 
12
  | `test_models.py` | 15 | All Pydantic model contracts |
13
  | `test_scenarios.py` | 8 | Scenario generation and determinism |
14
  | `test_validation.py` | 13 | Protocol validation checks |
15
+ | `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
16
  | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
17
+ | `test_reward.py` | 18 | JDG 01-03 scoring functions |
18
+ | `test_server.py` | 5 | API endpoint integration |
19
+ | **Total** | **134** | |
20
 
21
  ## Missing Coverage (not yet implemented)
22
 
23
  | File (planned) | Would cover |
24
  |---------------|-------------|
 
25
  | `test_env.py` | ENV 01-11 real environment |
 
26
 
27
  ---
28
 
replicalab/scoring/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Judge scoring engine β€” deterministic protocol evaluation."""
2
+
3
+ from .feasibility import score_feasibility
4
+ from .fidelity import score_fidelity
5
+ from .rigor import score_rigor
6
+
7
+ __all__ = [
8
+ "score_feasibility",
9
+ "score_fidelity",
10
+ "score_rigor",
11
+ ]
replicalab/scoring/feasibility.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """JDG 02 β€” Protocol feasibility score.
2
+
3
+ Derives a continuous [0, 1] signal from the existing FeasibilityCheckResult
4
+ produced by AGT 05. Does NOT rescore from scratch β€” this prevents drift
5
+ between Lab Manager grounding and Judge scoring.
6
+
7
+ Pure deterministic function β€” no LLM calls, no side effects.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ from replicalab.agents.lab_manager_policy import (
13
+ FeasibilityCheckResult,
14
+ check_feasibility,
15
+ )
16
+ from replicalab.models import Protocol
17
+ from replicalab.scenarios.templates import NormalizedScenarioPack
18
+
19
+
20
+ def score_feasibility(
21
+ protocol: Protocol,
22
+ scenario: NormalizedScenarioPack,
23
+ check: FeasibilityCheckResult | None = None,
24
+ ) -> float:
25
+ """Score resource feasibility as a continuous signal.
26
+
27
+ Returns a float in [0.0, 1.0].
28
+
29
+ If *check* is not provided, calls ``check_feasibility(protocol, scenario)``
30
+ to compute it. Passing a pre-computed check avoids redundant work.
31
+
32
+ Each of the 7 dimensions contributes equally (1/7). Binary dimensions
33
+ score 0 or 1. Continuous dimensions (budget, equipment, reagents, staff)
34
+ give partial credit based on how close the protocol is to passing.
35
+ """
36
+ if check is None:
37
+ check = check_feasibility(protocol, scenario)
38
+
39
+ weight = 1.0 / 7.0
40
+ lab = scenario.lab_manager_observation
41
+
42
+ scores = [
43
+ # Protocol β€” binary
44
+ 1.0 if check.protocol.ok else 0.0,
45
+ # Budget β€” partial credit
46
+ _budget_score(check, lab.budget_remaining),
47
+ # Equipment β€” partial credit
48
+ _fraction_score(
49
+ protocol.required_equipment,
50
+ set(lab.equipment_available),
51
+ ) if not check.equipment.ok else 1.0,
52
+ # Reagents β€” partial credit
53
+ _fraction_score(
54
+ protocol.required_reagents,
55
+ set(lab.reagents_in_stock),
56
+ ) if not check.reagents.ok else 1.0,
57
+ # Schedule β€” binary
58
+ 1.0 if check.schedule.ok else 0.0,
59
+ # Staff β€” partial credit
60
+ _staff_score(check, lab.staff_count),
61
+ # Policy β€” binary (hard constraint)
62
+ 1.0 if check.policy.ok else 0.0,
63
+ ]
64
+
65
+ raw = sum(s * weight for s in scores)
66
+ return max(0.0, min(1.0, raw))
67
+
68
+
69
+ # ---------------------------------------------------------------------------
70
+ # Partial credit helpers
71
+ # ---------------------------------------------------------------------------
72
+
73
+
74
+ def _budget_score(check: FeasibilityCheckResult, budget_remaining: float) -> float:
75
+ """Continuous budget score: ratio of budget to estimated cost."""
76
+ if check.budget.ok:
77
+ return 1.0
78
+ if check.estimated_cost <= 0:
79
+ return 0.0
80
+ return max(0.0, min(1.0, budget_remaining / check.estimated_cost))
81
+
82
+
83
+ def _staff_score(check: FeasibilityCheckResult, staff_count: int) -> float:
84
+ """Continuous staff score: ratio of available staff to required."""
85
+ if check.staff.ok:
86
+ return 1.0
87
+ if check.required_staff <= 0:
88
+ return 0.0
89
+ return max(0.0, min(1.0, staff_count / check.required_staff))
90
+
91
+
92
+ def _fraction_score(required: list[str], available: set[str]) -> float:
93
+ """Fraction of required items that are in the available set."""
94
+ if not required:
95
+ return 1.0
96
+ available_lower = {item.lower().strip() for item in available}
97
+ matched = sum(1 for item in required if item.lower().strip() in available_lower)
98
+ return matched / len(required)
replicalab/scoring/fidelity.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """JDG 03 β€” Protocol fidelity score.
2
+
3
+ Measures how closely the final protocol matches the hidden reference spec.
4
+ The scientist never sees the hidden spec β€” this score is for the judge only.
5
+
6
+ Substitution-aware: allowed substitutions get partial credit instead of 0.
7
+
8
+ Pure deterministic function β€” no LLM calls, no side effects.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ from replicalab.models import Protocol
14
+ from replicalab.scenarios.templates import NormalizedScenarioPack
15
+ from replicalab.utils.text import element_tokens, normalize_label
16
+
17
+
18
+ def score_fidelity(
19
+ protocol: Protocol,
20
+ scenario: NormalizedScenarioPack,
21
+ ) -> float:
22
+ """Score adherence to the hidden reference specification.
23
+
24
+ Returns a float in [0.0, 1.0].
25
+
26
+ Breakdown (weights):
27
+ - Required element coverage: 0.50 (substitution-aware)
28
+ - Flexible element alignment: 0.20 (bonus, no penalty for missing)
29
+ - Target metric alignment: 0.20
30
+ - Technique appropriateness: 0.10
31
+ """
32
+ spec = scenario.hidden_reference_spec
33
+ protocol_text = _protocol_text_blob(protocol)
34
+ sub_index = _build_substitution_index(scenario)
35
+
36
+ required = _required_element_score(spec.required_elements, protocol_text, sub_index)
37
+ flexible = _flexible_element_score(spec.flexible_elements, protocol_text)
38
+ target = _target_metric_score(spec.target_metric, spec.target_value, protocol_text)
39
+ technique = _technique_score(spec.summary, protocol_text)
40
+
41
+ raw = (0.50 * required) + (0.20 * flexible) + (0.20 * target) + (0.10 * technique)
42
+ return max(0.0, min(1.0, raw))
43
+
44
+
45
+ # ---------------------------------------------------------------------------
46
+ # Sub-scores
47
+ # ---------------------------------------------------------------------------
48
+
49
+
50
+ def _required_element_score(
51
+ required_elements: list[str],
52
+ protocol_text: str,
53
+ sub_index: dict[str, list[str]],
54
+ ) -> float:
55
+ """Score coverage of required elements, with substitution partial credit.
56
+
57
+ Direct match: 1.0 credit per element.
58
+ Substitution match: 0.7 credit (allowed substitution present instead).
59
+ Miss: 0.0 credit.
60
+ """
61
+ if not required_elements:
62
+ return 1.0
63
+
64
+ total = 0.0
65
+ for element in required_elements:
66
+ if _text_matches(element, protocol_text):
67
+ total += 1.0
68
+ elif _substitution_matches(element, protocol_text, sub_index):
69
+ total += 0.7
70
+ # else: 0.0
71
+
72
+ return total / len(required_elements)
73
+
74
+
75
+ def _flexible_element_score(
76
+ flexible_elements: list[str],
77
+ protocol_text: str,
78
+ ) -> float:
79
+ """Bonus for addressing flexible elements. No penalty for missing."""
80
+ if not flexible_elements:
81
+ return 1.0
82
+
83
+ matched = sum(1 for el in flexible_elements if _text_matches(el, protocol_text))
84
+ return matched / len(flexible_elements)
85
+
86
+
87
+ def _target_metric_score(
88
+ target_metric: str,
89
+ target_value: str,
90
+ protocol_text: str,
91
+ ) -> float:
92
+ """Score for mentioning the target metric and value."""
93
+ score = 0.0
94
+ if _text_matches(target_metric, protocol_text):
95
+ score += 0.5
96
+ if _text_matches(target_value, protocol_text):
97
+ score += 0.5
98
+ return score
99
+
100
+
101
+ def _technique_score(summary: str, protocol_text: str) -> float:
102
+ """Score for technique alignment with the hidden spec summary."""
103
+ return 1.0 if _text_matches(summary, protocol_text) else 0.0
104
+
105
+
106
+ # ---------------------------------------------------------------------------
107
+ # Helpers
108
+ # ---------------------------------------------------------------------------
109
+
110
+
111
+ def _protocol_text_blob(protocol: Protocol) -> str:
112
+ """Join all protocol text fields into one lowercase blob for matching."""
113
+ return " ".join([
114
+ protocol.technique,
115
+ protocol.rationale,
116
+ " ".join(protocol.controls),
117
+ " ".join(protocol.required_equipment),
118
+ " ".join(protocol.required_reagents),
119
+ ]).lower()
120
+
121
+
122
+ def _text_matches(element: str, blob: str) -> bool:
123
+ """Check if any token from *element* appears in *blob*."""
124
+ tokens = element_tokens(element)
125
+ return any(token in blob for token in tokens) if tokens else False
126
+
127
+
128
+ def _substitution_matches(
129
+ element: str,
130
+ protocol_text: str,
131
+ sub_index: dict[str, list[str]],
132
+ ) -> bool:
133
+ """Check if an allowed substitution for *element* appears in the protocol."""
134
+ norm = normalize_label(element)
135
+ alternatives = sub_index.get(norm, [])
136
+ return any(_text_matches(alt, protocol_text) for alt in alternatives)
137
+
138
+
139
+ def _build_substitution_index(
140
+ scenario: NormalizedScenarioPack,
141
+ ) -> dict[str, list[str]]:
142
+ """Map normalized originals to their alternative labels."""
143
+ index: dict[str, list[str]] = {}
144
+ for sub in scenario.allowed_substitutions:
145
+ key = normalize_label(sub.original)
146
+ index.setdefault(key, []).append(sub.alternative)
147
+ return index
replicalab/scoring/rigor.py ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """JDG 01 β€” Protocol rigor score.
2
+
3
+ Measures structural completeness and alignment to scenario requirements.
4
+ Pure deterministic function β€” no LLM calls, no side effects.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ from replicalab.models import Protocol
10
+ from replicalab.scenarios.templates import NormalizedScenarioPack
11
+ from replicalab.utils.text import element_tokens
12
+
13
+
14
+ def score_rigor(protocol: Protocol, scenario: NormalizedScenarioPack) -> float:
15
+ """Score protocol structural quality and requirement coverage.
16
+
17
+ Returns a float in [0.0, 1.0].
18
+
19
+ Breakdown (weights):
20
+ - Structural completeness: 0.30
21
+ - Success criteria coverage: 0.40
22
+ - Required element coverage: 0.30
23
+ """
24
+ completeness = _structural_completeness(protocol)
25
+ criteria = _success_criteria_coverage(protocol, scenario)
26
+ elements = _required_element_coverage(protocol, scenario)
27
+
28
+ raw = (0.30 * completeness) + (0.40 * criteria) + (0.30 * elements)
29
+ return max(0.0, min(1.0, raw))
30
+
31
+
32
+ # ---------------------------------------------------------------------------
33
+ # Sub-scores
34
+ # ---------------------------------------------------------------------------
35
+
36
+
37
+ def _structural_completeness(protocol: Protocol) -> float:
38
+ """Score based on whether protocol fields are meaningfully populated."""
39
+ score = 0.0
40
+ total = 0.35
41
+
42
+ # sample_size present
43
+ if protocol.sample_size >= 1:
44
+ score += 0.05
45
+ # sample_size statistically meaningful
46
+ if protocol.sample_size >= 4:
47
+ score += 0.05
48
+ # at least one control
49
+ if len(protocol.controls) >= 1:
50
+ score += 0.05
51
+ # second control (stronger design)
52
+ if len(protocol.controls) >= 2:
53
+ score += 0.05
54
+ # technique specified
55
+ if protocol.technique:
56
+ score += 0.05
57
+ # duration allocated
58
+ if protocol.duration_days >= 1:
59
+ score += 0.05
60
+ # rationale is substantive (not a placeholder)
61
+ if len(protocol.rationale) > 20:
62
+ score += 0.05
63
+
64
+ return score / total
65
+
66
+
67
+ def _success_criteria_coverage(
68
+ protocol: Protocol,
69
+ scenario: NormalizedScenarioPack,
70
+ ) -> float:
71
+ """Fraction of scenario success criteria addressed by protocol text."""
72
+ criteria = scenario.success_criteria
73
+ if not criteria:
74
+ return 1.0
75
+
76
+ protocol_text = _protocol_text_blob(protocol)
77
+ matched = sum(1 for c in criteria if _text_matches(c, protocol_text))
78
+ return matched / len(criteria)
79
+
80
+
81
+ def _required_element_coverage(
82
+ protocol: Protocol,
83
+ scenario: NormalizedScenarioPack,
84
+ ) -> float:
85
+ """Fraction of hidden reference required elements addressed."""
86
+ required = scenario.hidden_reference_spec.required_elements
87
+ if not required:
88
+ return 1.0
89
+
90
+ protocol_text = _protocol_text_blob(protocol)
91
+ matched = sum(1 for el in required if _text_matches(el, protocol_text))
92
+ return matched / len(required)
93
+
94
+
95
+ # ---------------------------------------------------------------------------
96
+ # Helpers
97
+ # ---------------------------------------------------------------------------
98
+
99
+
100
+ def _protocol_text_blob(protocol: Protocol) -> str:
101
+ """Join all protocol text fields into one lowercase blob for matching."""
102
+ return " ".join([
103
+ protocol.technique,
104
+ protocol.rationale,
105
+ " ".join(protocol.controls),
106
+ " ".join(protocol.required_equipment),
107
+ " ".join(protocol.required_reagents),
108
+ ]).lower()
109
+
110
+
111
+ def _text_matches(element: str, blob: str) -> bool:
112
+ """Check if any token from *element* appears in *blob*."""
113
+ tokens = element_tokens(element)
114
+ return any(token in blob for token in tokens) if tokens else False
replicalab/utils/text.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Shared text-matching helpers used by validation and scoring modules."""
2
+
3
+ from __future__ import annotations
4
+
5
+
6
+ def normalize_label(label: str) -> str:
7
+ """Lowercase, strip, collapse whitespace for fuzzy matching."""
8
+ return " ".join(label.lower().split())
9
+
10
+
11
+ def element_tokens(element: str) -> list[str]:
12
+ """Split a required-element description into searchable tokens.
13
+
14
+ Returns individual significant words (>= 3 chars) so that
15
+ "transaction cost assumption" matches a rationale containing
16
+ "transaction" or "cost".
17
+ """
18
+ return [word for word in element.lower().split() if len(word) >= 3]
replicalab/utils/validation.py CHANGED
@@ -16,6 +16,8 @@ from pydantic import BaseModel, ConfigDict, Field
16
 
17
  from replicalab.models import Protocol
18
  from replicalab.scenarios.templates import NormalizedScenarioPack
 
 
19
 
20
 
21
  # ---------------------------------------------------------------------------
@@ -264,25 +266,6 @@ def _check_required_element_coverage(
264
  # ---------------------------------------------------------------------------
265
 
266
 
267
- def _normalize(label: str) -> str:
268
- """Lowercase, strip, collapse whitespace for fuzzy matching."""
269
- return " ".join(label.lower().split())
270
-
271
-
272
- def _element_tokens(element: str) -> list[str]:
273
- """Split a required-element description into searchable tokens.
274
-
275
- Returns individual significant words (>= 3 chars) so that
276
- "transaction cost assumption" matches a rationale containing
277
- "transaction" or "cost".
278
- """
279
- return [
280
- word
281
- for word in element.lower().split()
282
- if len(word) >= 3
283
- ]
284
-
285
-
286
  def _substitution_alternatives(scenario: NormalizedScenarioPack) -> set[str]:
287
  """Return the normalized 'original' values from allowed substitutions."""
288
  return {
 
16
 
17
  from replicalab.models import Protocol
18
  from replicalab.scenarios.templates import NormalizedScenarioPack
19
+ from replicalab.utils.text import element_tokens as _element_tokens
20
+ from replicalab.utils.text import normalize_label as _normalize
21
 
22
 
23
  # ---------------------------------------------------------------------------
 
266
  # ---------------------------------------------------------------------------
267
 
268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
  def _substitution_alternatives(scenario: NormalizedScenarioPack) -> set[str]:
270
  """Return the normalized 'original' values from allowed substitutions."""
271
  return {
tests/test_reward.py ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for JDG 01–03 scoring functions."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from replicalab.agents.lab_manager_policy import check_feasibility
6
+ from replicalab.models import Protocol
7
+ from replicalab.scenarios import generate_scenario
8
+ from replicalab.scoring import score_feasibility, score_fidelity, score_rigor
9
+
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Helpers
13
+ # ---------------------------------------------------------------------------
14
+
15
+
16
+ def _scenario(template: str = "ml_benchmark", difficulty: str = "easy"):
17
+ return generate_scenario(seed=42, template=template, difficulty=difficulty)
18
+
19
+
20
+ def _good_protocol(scenario) -> Protocol:
21
+ """Build a well-formed protocol aligned to the scenario."""
22
+ lab = scenario.lab_manager_observation
23
+ spec = scenario.hidden_reference_spec
24
+ return Protocol(
25
+ sample_size=10,
26
+ controls=["baseline", "ablation"],
27
+ technique=spec.summary[:60] if spec.summary else "replication_plan",
28
+ duration_days=max(1, min(2, lab.time_limit_days)),
29
+ required_equipment=(
30
+ list(lab.equipment_available[:1])
31
+ if lab.equipment_available
32
+ else []
33
+ ),
34
+ required_reagents=(
35
+ list(lab.reagents_in_stock[:1])
36
+ if lab.reagents_in_stock
37
+ else []
38
+ ),
39
+ rationale=(
40
+ f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
41
+ f"Target metric: {spec.target_metric}. "
42
+ f"Target value: {spec.target_value}. "
43
+ "Stay within budget and schedule."
44
+ ),
45
+ )
46
+
47
+
48
+ def _bad_protocol() -> Protocol:
49
+ """Build a minimal protocol that misses most requirements."""
50
+ return Protocol(
51
+ sample_size=1,
52
+ controls=[],
53
+ technique="unknown_method",
54
+ duration_days=1,
55
+ required_equipment=[],
56
+ required_reagents=[],
57
+ rationale="No plan.",
58
+ )
59
+
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # JDG 01 β€” score_rigor
63
+ # ---------------------------------------------------------------------------
64
+
65
+
66
+ def test_rigor_good_protocol_scores_higher_than_bad() -> None:
67
+ scenario = _scenario("ml_benchmark", "easy")
68
+ good = _good_protocol(scenario)
69
+ bad = _bad_protocol()
70
+
71
+ good_score = score_rigor(good, scenario)
72
+ bad_score = score_rigor(bad, scenario)
73
+
74
+ assert good_score > bad_score
75
+ assert 0.0 <= good_score <= 1.0
76
+ assert 0.0 <= bad_score <= 1.0
77
+
78
+
79
+ def test_rigor_is_deterministic() -> None:
80
+ scenario = _scenario("ml_benchmark", "medium")
81
+ protocol = _good_protocol(scenario)
82
+
83
+ first = score_rigor(protocol, scenario)
84
+ second = score_rigor(protocol, scenario)
85
+
86
+ assert first == second
87
+
88
+
89
+ def test_rigor_empty_controls_reduces_score() -> None:
90
+ scenario = _scenario("math_reasoning", "easy")
91
+ with_controls = _good_protocol(scenario)
92
+ without_controls = with_controls.model_copy(update={"controls": ["only_one"]})
93
+
94
+ score_with = score_rigor(with_controls, scenario)
95
+ score_without = score_rigor(without_controls, scenario)
96
+
97
+ assert score_with >= score_without
98
+
99
+
100
+ def test_rigor_short_rationale_reduces_score() -> None:
101
+ scenario = _scenario("finance_trading", "easy")
102
+ good = _good_protocol(scenario)
103
+ short = good.model_copy(update={"rationale": "OK."})
104
+
105
+ assert score_rigor(good, scenario) > score_rigor(short, scenario)
106
+
107
+
108
+ def test_rigor_all_domains_return_valid_range() -> None:
109
+ for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
110
+ for difficulty in ("easy", "medium", "hard"):
111
+ scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
112
+ protocol = _good_protocol(scenario)
113
+ score = score_rigor(protocol, scenario)
114
+ assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
115
+
116
+
117
+ # ---------------------------------------------------------------------------
118
+ # JDG 02 β€” score_feasibility
119
+ # ---------------------------------------------------------------------------
120
+
121
+
122
+ def test_feasibility_viable_protocol_scores_high() -> None:
123
+ scenario = _scenario("ml_benchmark", "easy")
124
+ protocol = _good_protocol(scenario)
125
+
126
+ score = score_feasibility(protocol, scenario)
127
+
128
+ assert score > 0.7
129
+ assert 0.0 <= score <= 1.0
130
+
131
+
132
+ def test_feasibility_infeasible_protocol_scores_lower() -> None:
133
+ scenario = _scenario("ml_benchmark", "easy")
134
+ good = _good_protocol(scenario)
135
+ # Blow the budget and schedule
136
+ bad = good.model_copy(update={
137
+ "sample_size": 200,
138
+ "duration_days": scenario.lab_manager_observation.time_limit_days + 5,
139
+ "required_equipment": ["Imaginary Device"],
140
+ })
141
+
142
+ good_score = score_feasibility(good, scenario)
143
+ bad_score = score_feasibility(bad, scenario)
144
+
145
+ assert good_score > bad_score
146
+
147
+
148
+ def test_feasibility_accepts_precomputed_check() -> None:
149
+ scenario = _scenario("finance_trading", "easy")
150
+ protocol = _good_protocol(scenario)
151
+ check = check_feasibility(protocol, scenario)
152
+
153
+ score_with = score_feasibility(protocol, scenario, check=check)
154
+ score_without = score_feasibility(protocol, scenario)
155
+
156
+ assert score_with == score_without
157
+
158
+
159
+ def test_feasibility_is_deterministic() -> None:
160
+ scenario = _scenario("math_reasoning", "medium")
161
+ protocol = _good_protocol(scenario)
162
+
163
+ first = score_feasibility(protocol, scenario)
164
+ second = score_feasibility(protocol, scenario)
165
+
166
+ assert first == second
167
+
168
+
169
+ def test_feasibility_partial_credit_for_near_budget() -> None:
170
+ """A protocol slightly over budget should score higher than one far over."""
171
+ scenario = _scenario("ml_benchmark", "easy")
172
+ good = _good_protocol(scenario)
173
+
174
+ slightly_over = good.model_copy(update={"sample_size": 40})
175
+ far_over = good.model_copy(update={"sample_size": 200})
176
+
177
+ score_slight = score_feasibility(slightly_over, scenario)
178
+ score_far = score_feasibility(far_over, scenario)
179
+
180
+ assert score_slight >= score_far
181
+
182
+
183
+ def test_feasibility_all_domains_return_valid_range() -> None:
184
+ for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
185
+ for difficulty in ("easy", "medium", "hard"):
186
+ scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
187
+ protocol = _good_protocol(scenario)
188
+ score = score_feasibility(protocol, scenario)
189
+ assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
190
+
191
+
192
+ # ---------------------------------------------------------------------------
193
+ # JDG 03 β€” score_fidelity
194
+ # ---------------------------------------------------------------------------
195
+
196
+
197
+ def test_fidelity_aligned_protocol_scores_higher() -> None:
198
+ scenario = _scenario("ml_benchmark", "easy")
199
+ aligned = _good_protocol(scenario)
200
+ misaligned = _bad_protocol()
201
+
202
+ aligned_score = score_fidelity(aligned, scenario)
203
+ misaligned_score = score_fidelity(misaligned, scenario)
204
+
205
+ assert aligned_score > misaligned_score
206
+ assert 0.0 <= aligned_score <= 1.0
207
+ assert 0.0 <= misaligned_score <= 1.0
208
+
209
+
210
+ def test_fidelity_is_deterministic() -> None:
211
+ scenario = _scenario("finance_trading", "hard")
212
+ protocol = _good_protocol(scenario)
213
+
214
+ first = score_fidelity(protocol, scenario)
215
+ second = score_fidelity(protocol, scenario)
216
+
217
+ assert first == second
218
+
219
+
220
+ def test_fidelity_substitution_gets_partial_credit() -> None:
221
+ """Using an allowed substitution should score better than a total miss."""
222
+ scenario = _scenario("math_reasoning", "easy")
223
+ spec = scenario.hidden_reference_spec
224
+
225
+ # Find a required element that has a substitution
226
+ sub_map = {}
227
+ for sub in scenario.allowed_substitutions:
228
+ sub_map[sub.original.lower()] = sub.alternative
229
+
230
+ if not sub_map or not spec.required_elements:
231
+ return # skip if no substitution exists in this scenario
232
+
233
+ # Build protocol that uses the substitution alternative
234
+ first_sub_original = list(sub_map.keys())[0]
235
+ first_sub_alt = sub_map[first_sub_original]
236
+
237
+ with_sub = _good_protocol(scenario).model_copy(update={
238
+ "rationale": f"We will use {first_sub_alt} instead. " + spec.target_metric,
239
+ })
240
+ without_anything = _bad_protocol()
241
+
242
+ score_sub = score_fidelity(with_sub, scenario)
243
+ score_miss = score_fidelity(without_anything, scenario)
244
+
245
+ assert score_sub > score_miss
246
+
247
+
248
+ def test_fidelity_mentioning_target_metric_improves_score() -> None:
249
+ scenario = _scenario("ml_benchmark", "easy")
250
+ spec = scenario.hidden_reference_spec
251
+
252
+ with_metric = _good_protocol(scenario)
253
+ without_metric = with_metric.model_copy(update={
254
+ "rationale": "Generic plan without any specific metric mentioned.",
255
+ })
256
+
257
+ score_with = score_fidelity(with_metric, scenario)
258
+ score_without = score_fidelity(without_metric, scenario)
259
+
260
+ assert score_with >= score_without
261
+
262
+
263
+ def test_fidelity_all_domains_return_valid_range() -> None:
264
+ for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
265
+ for difficulty in ("easy", "medium", "hard"):
266
+ scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
267
+ protocol = _good_protocol(scenario)
268
+ score = score_fidelity(protocol, scenario)
269
+ assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
270
+
271
+
272
+ # ---------------------------------------------------------------------------
273
+ # Cross-scorer consistency
274
+ # ---------------------------------------------------------------------------
275
+
276
+
277
+ def test_all_scores_between_zero_and_one_for_bad_protocol() -> None:
278
+ for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
279
+ scenario = generate_scenario(seed=7, template=template, difficulty="hard")
280
+ bad = _bad_protocol()
281
+
282
+ r = score_rigor(bad, scenario)
283
+ fe = score_feasibility(bad, scenario)
284
+ fi = score_fidelity(bad, scenario)
285
+
286
+ assert 0.0 <= r <= 1.0, f"rigor {template}: {r}"
287
+ assert 0.0 <= fe <= 1.0, f"feasibility {template}: {fe}"
288
+ assert 0.0 <= fi <= 1.0, f"fidelity {template}: {fi}"
289
+
290
+
291
+ def test_good_protocol_dominates_bad_on_rigor_and_fidelity() -> None:
292
+ """Good protocol beats bad on rigor and fidelity.
293
+
294
+ Feasibility is excluded: a protocol that asks for nothing is trivially
295
+ feasible (no equipment, no reagents β†’ nothing can fail). The other two
296
+ scores correctly penalize an empty plan.
297
+ """
298
+ scenario = _scenario("ml_benchmark", "easy")
299
+ good = _good_protocol(scenario)
300
+ bad = _bad_protocol()
301
+
302
+ assert score_rigor(good, scenario) > score_rigor(bad, scenario)
303
+ assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)