Spaces:

ayushozha
/

replicalab

Running

ayushozha Claude Opus 4.6 commited on Mar 8

Commit

e50dca9

1 Parent(s): 0ed9084

Add deterministic judge scoring engine (JDG 01-03)

JDG 01 (rigor): structural completeness + success criteria + required
element coverage. Weighted sub-scores with token matching.

JDG 02 (feasibility): continuous signal from FeasibilityCheckResult
with partial credit for budget, equipment, reagents, and staff.
Does not rescore — wraps AGT 05 to prevent drift.

JDG 03 (fidelity): substitution-aware adherence to hidden reference
spec. Direct match = 1.0, allowed substitution = 0.7, miss = 0.0.

Shared text helpers extracted to replicalab/utils/text.py.
18 new tests covering all 3 scorers across all domains and difficulties.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (10) hide show

docs/map/README.md +15 -8
docs/map/scoring.md +162 -31
docs/map/tests.md +5 -5
replicalab/scoring/__init__.py +11 -0
replicalab/scoring/feasibility.py +98 -0
replicalab/scoring/fidelity.py +147 -0
replicalab/scoring/rigor.py +114 -0
replicalab/utils/text.py +18 -0
replicalab/utils/validation.py +2 -19
tests/test_reward.py +303 -0

docs/map/README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 > Living reference of every module, class, function, and relationship.
 > Updated after each implementation session.
 >
-> **Last updated:** 2026-03-07
 ## Module Index
@@ -13,7 +13,7 @@
 | [scenarios.md](scenarios.md) | Scenario generation — templates, constraints, resources, hidden specs |
 | [agents.md](agents.md) | Agent policies — scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
 | [validation.md](validation.md) | Protocol validation — deterministic checks against scenario constraints |
-| [scoring.md](scoring.md) | Judge scoring — rigor, feasibility, fidelity (NOT YET IMPLEMENTED) |
 | [server.md](server.md) | FastAPI server — REST + WebSocket endpoints, stub environment |
 | [frontend.md](frontend.md) | React UI — dashboard, episode viewer, components |
 | [config.md](config.md) | Shared constants — rounds, budget, timeouts |
@@ -47,10 +47,11 @@ replicalab/utils/validation.py
  ├── replicalab.models (Protocol)
  └── replicalab.scenarios.templates (NormalizedScenarioPack)
-replicalab/scoring/           <-- NOT YET IMPLEMENTED
  ├── replicalab.models (Protocol, RewardBreakdown)
  ├── replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
- └── replicalab.agents.lab_manager_policy (check_feasibility, FeasibilityCheckResult)
 ```
 ## File Tree (implemented only)
@@ -70,10 +71,14 @@ replicalab/
  │   ├── math_reasoning.py    (2 cases: Cauchy-Schwarz, Jensen's inequality)
  │   ├── ml_benchmark.py      (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
  │   └── finance_trading.py   (2 cases: SPY/QQQ mean-reversion, momentum futures)
- ├── scoring/                 <-- EMPTY (JDG 01-03 not yet built)
- │   └── .gitkeep
  └── utils/
      ├── seed.py              (deterministic RNG from SHA256)
      └── validation.py        (MOD 05: protocol validation, 5 checks)
 server/
@@ -98,7 +103,9 @@ tests/
  ├── test_scenarios.py        (8 tests)
  ├── test_validation.py       (13 tests)
  ├── test_scientist_policy.py (18 tests)
- └── test_lab_manager_policy.py(13 tests)
 ```
 ## Task Completion Status
@@ -108,7 +115,7 @@ tests/
 | Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
 | Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
 | Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
-| Judge (JDG) | — | JDG 01-08 | Entire scoring engine |
 | Environment (ENV) | — | ENV 01-11 | Entire real environment |
 | Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
 | Frontend (FND) | FND 01-10 | — | Complete |

 > Living reference of every module, class, function, and relationship.
 > Updated after each implementation session.
 >
+> **Last updated:** 2026-03-07 (JDG 01-03 scoring implemented)
 ## Module Index
 | [scenarios.md](scenarios.md) | Scenario generation — templates, constraints, resources, hidden specs |
 | [agents.md](agents.md) | Agent policies — scientist prompt/parse/retry, lab manager feasibility/suggest/compose |
 | [validation.md](validation.md) | Protocol validation — deterministic checks against scenario constraints |
+| [scoring.md](scoring.md) | Judge scoring — rigor, feasibility, fidelity |
 | [server.md](server.md) | FastAPI server — REST + WebSocket endpoints, stub environment |
 | [frontend.md](frontend.md) | React UI — dashboard, episode viewer, components |
 | [config.md](config.md) | Shared constants — rounds, budget, timeouts |
  ├── replicalab.models (Protocol)
  └── replicalab.scenarios.templates (NormalizedScenarioPack)
+replicalab/scoring/
  ├── replicalab.models (Protocol, RewardBreakdown)
  ├── replicalab.scenarios (NormalizedScenarioPack, HiddenReferenceSpec)
+ ├── replicalab.agents.lab_manager_policy (check_feasibility, FeasibilityCheckResult)
+ └── replicalab.utils.text (element_tokens, normalize_label)
 ```
 ## File Tree (implemented only)
  │   ├── math_reasoning.py    (2 cases: Cauchy-Schwarz, Jensen's inequality)
  │   ├── ml_benchmark.py      (2 cases: AG News TinyBERT, CIFAR-10 ResNet-18)
  │   └── finance_trading.py   (2 cases: SPY/QQQ mean-reversion, momentum futures)
+ ├── scoring/
+ │   ├── __init__.py          (exports score_rigor, score_feasibility, score_fidelity)
+ │   ├── rigor.py             (JDG 01: structural quality + criteria coverage)
+ │   ├── feasibility.py       (JDG 02: wraps FeasibilityCheckResult with partial credit)
+ │   └── fidelity.py          (JDG 03: substitution-aware hidden spec alignment)
  └── utils/
      ├── seed.py              (deterministic RNG from SHA256)
+     ├── text.py              (shared token matching: normalize_label, element_tokens)
      └── validation.py        (MOD 05: protocol validation, 5 checks)
 server/
  ├── test_scenarios.py        (8 tests)
  ├── test_validation.py       (13 tests)
  ├── test_scientist_policy.py (18 tests)
+ ├── test_lab_manager_policy.py(13 tests)
+ ├── test_reward.py           (18 tests — JDG 01-03 scoring)
+ └── test_server.py           (5 tests — API endpoints)
 ```
 ## Task Completion Status
 | Models (MOD) | MOD 01-05, 09, 11-12 | MOD 06 | Semantic validators for impossible plans |
 | Scenarios (SCN) | SCN 01-12 | SCN 13 | Booking/scheduling data model |
 | Agents (AGT) | AGT 01-07, 11 | AGT 08-10 | LLM-backed scientist, model selection |
+| Judge (JDG) | JDG 01-03 | JDG 04-08 | Reward composition, bonuses, penalties |
 | Environment (ENV) | — | ENV 01-11 | Entire real environment |
 | Server (API) | API 01-04, 06 (partial) | API 05, 07-10 | Replay, auth, rate limiting |
 | Frontend (FND) | FND 01-10 | — | Complete |

docs/map/scoring.md CHANGED Viewed

@@ -1,57 +1,188 @@
 # Scoring Map — `replicalab/scoring/`
 > Judge scoring engine for protocol evaluation.
 >
-> **Status:** NOT YET IMPLEMENTED
-> **Tasks remaining:** JDG 01-08
-## Planned Architecture
 ```
 replicalab/scoring/
-    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity, compute_reward
     rigor.py             # JDG 01 — protocol structural quality
     feasibility.py       # JDG 02 — resource feasibility (wraps AGT 05)
     fidelity.py          # JDG 03 — adherence to hidden reference spec
 ```
-## Planned Functions
-### `score_rigor(protocol, scenario) -> float` — JDG 01
-Score range: [0.0, 1.0]
-Measures: structural completeness, success criteria coverage, required element coverage.
-### `score_feasibility(protocol, scenario, check_result=None) -> float` — JDG 02
-Score range: [0.0, 1.0]
-Measures: whether the lab can execute the protocol (budget, equipment, reagents, schedule, staff, policy).
-Reuses `check_feasibility()` from AGT 05. Adds partial credit (continuous signal vs binary pass/fail).
-### `score_fidelity(protocol, scenario) -> float` — JDG 03
-Score range: [0.0, 1.0]
-Measures: how closely the protocol matches `hidden_reference_spec`.
-Substitution-aware — allowed substitutions get partial credit.
-### `compute_reward(protocol, scenario, check_result=None) -> RewardBreakdown` — JDG 04/05
-Combines rigor + feasibility + fidelity into `RewardBreakdown`.
-Applies efficiency bonus, communication bonus, and penalties.
 ## Data Consumed
 | Source | Used by | For what |
 |--------|---------|----------|
-| `Protocol` (models.py) | All scorers | The final agreed protocol |
-| `NormalizedScenarioPack` (scenarios) | All scorers | Constraints, resources, success criteria |
 | `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
-| `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks |
 | `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
-| `RewardBreakdown` (models.py) | JDG 04/05 | Output container |
-## Data Produced
-`RewardBreakdown` populates:
-- `rigor: float` — from JDG 01
-- `feasibility: float` — from JDG 02
-- `fidelity: float` — from JDG 03
-- `efficiency_bonus: float` — from JDG 04 (rounds used / max rounds)
-- `communication_bonus: float` — from JDG 05 (negotiation quality)
-- `penalties: dict[str, float]` — from JDG 06-08 (policy violations, etc.)

 # Scoring Map — `replicalab/scoring/`
 > Judge scoring engine for protocol evaluation.
+> Pure deterministic functions — no LLM calls, no side effects.
 >
+> **Tasks implemented:** JDG 01, JDG 02, JDG 03
+> **Tasks remaining:** JDG 04-08
+## Architecture
 ```
 replicalab/scoring/
+    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity
     rigor.py             # JDG 01 — protocol structural quality
     feasibility.py       # JDG 02 — resource feasibility (wraps AGT 05)
     fidelity.py          # JDG 03 — adherence to hidden reference spec
 ```
+## Shared Utilities
+Token matching extracted into `replicalab/utils/text.py`:
+- `normalize_label(label) -> str` — lowercase, strip, collapse whitespace
+- `element_tokens(element) -> list[str]` — split into searchable tokens (3+ chars)
+Used by: `validation.py`, `rigor.py`, `fidelity.py`
+---
+## JDG 01 — `score_rigor(protocol, scenario) -> float`
+**File:** `rigor.py`
+**Range:** [0.0, 1.0]
+**Measures:** structural completeness and alignment to scenario requirements.
+### Weight Breakdown
+| Component | Weight | Method |
+|-----------|--------|--------|
+| Structural completeness | 0.30 | Field population checks |
+| Success criteria coverage | 0.40 | Token match vs `scenario.success_criteria` |
+| Required element coverage | 0.30 | Token match vs `hidden_reference_spec.required_elements` |
+### Structural Completeness (0.30)
+Each check contributes equally (0.05 each, total 0.35, then normalized):
+| Check | Condition |
+|-------|-----------|
+| Sample size present | `sample_size >= 1` |
+| Sample size meaningful | `sample_size >= 4` |
+| Has control | `len(controls) >= 1` |
+| Has second control | `len(controls) >= 2` |
+| Technique specified | `technique` non-empty |
+| Duration allocated | `duration_days >= 1` |
+| Substantive rationale | `len(rationale) > 20` |
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `_structural_completeness(protocol)` | Field population score |
+| `_success_criteria_coverage(protocol, scenario)` | Fraction of criteria matched |
+| `_required_element_coverage(protocol, scenario)` | Fraction of elements matched |
+| `_protocol_text_blob(protocol)` | Join all text fields for matching |
+| `_text_matches(element, blob)` | Token overlap check |
+---
+## JDG 02 — `score_feasibility(protocol, scenario, check=None) -> float`
+**File:** `feasibility.py`
+**Range:** [0.0, 1.0]
+**Measures:** whether the lab can execute this protocol.
+### Key Design: No Rescoring
+Does NOT recompute feasibility from scratch. Derives score from `FeasibilityCheckResult`
+produced by AGT 05's `check_feasibility()`. If no pre-computed check is passed, calls
+`check_feasibility()` internally. This prevents drift between Lab Manager grounding
+and Judge scoring.
+### Weight Breakdown
+7 dimensions, each worth 1/7:
+| Dimension | Type | Partial Credit Formula |
+|-----------|------|----------------------|
+| Protocol | Binary | 1.0 if ok, else 0.0 |
+| Budget | Continuous | `min(1.0, budget_remaining / estimated_cost)` |
+| Equipment | Continuous | fraction of required items that are available |
+| Reagents | Continuous | fraction of required items that are in stock |
+| Schedule | Binary | 1.0 if ok, else 0.0 |
+| Staff | Continuous | `min(1.0, staff_count / required_staff)` |
+| Policy | Binary | 1.0 if ok, else 0.0 (hard constraint) |
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `_budget_score(check, budget_remaining)` | Continuous budget ratio |
+| `_staff_score(check, staff_count)` | Continuous staff ratio |
+| `_fraction_score(required, available)` | Generic item-availability fraction |
+---
+## JDG 03 — `score_fidelity(protocol, scenario) -> float`
+**File:** `fidelity.py`
+**Range:** [0.0, 1.0]
+**Measures:** adherence to `hidden_reference_spec` (which the scientist never sees).
+### Weight Breakdown
+| Component | Weight | Method |
+|-----------|--------|--------|
+| Required element coverage | 0.50 | Substitution-aware token match |
+| Flexible element alignment | 0.20 | Bonus only, no penalty |
+| Target metric alignment | 0.20 | Token match vs metric + value |
+| Technique appropriateness | 0.10 | Token match vs spec summary |
+### Substitution-Aware Scoring
+For required elements:
+- **Direct match** (token in protocol text): 1.0 credit
+- **Substitution match** (allowed alternative present): 0.7 credit
+- **Miss**: 0.0 credit
+This is the key difference from JDG 01's element check.
+### Internal Functions
+| Function | Purpose |
+|----------|---------|
+| `_required_element_score(elements, text, sub_index)` | Substitution-aware coverage |
+| `_flexible_element_score(elements, text)` | Bonus-only coverage |
+| `_target_metric_score(metric, value, text)` | Metric + value matching |
+| `_technique_score(summary, text)` | Summary alignment |
+| `_protocol_text_blob(protocol)` | Join text fields |
+| `_text_matches(element, blob)` | Token overlap |
+| `_substitution_matches(element, text, sub_index)` | Check alternatives |
+| `_build_substitution_index(scenario)` | Map originals → alternatives |
+---
+## Not Yet Implemented
+### `compute_reward(protocol, scenario, ...) -> RewardBreakdown` — JDG 04/05
+Combines rigor + feasibility + fidelity with weights.
+Applies efficiency bonus (rounds used), communication bonus, and penalties.
+### Bonuses & Penalties — JDG 06-08
+- `efficiency_bonus`: reward for finishing in fewer rounds
+- `communication_bonus`: reward for clear negotiation
+- `penalties`: policy violations, hallucinated resources, etc.
 ## Data Consumed
 | Source | Used by | For what |
 |--------|---------|----------|
+| `Protocol` (models.py) | All 3 scorers | The final agreed protocol |
+| `NormalizedScenarioPack` (scenarios) | All 3 scorers | Constraints, resources, criteria |
 | `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric |
+| `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks with partial credit |
 | `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions |
+| `element_tokens` (utils/text.py) | JDG 01, JDG 03 | Shared token extraction |
+## Test Coverage — `tests/test_reward.py`
+| Test | What it verifies |
+|------|-----------------|
+| `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering |
+| `test_rigor_is_deterministic` | Same inputs → same output |
+| `test_rigor_empty_controls_reduces_score` | Controls matter |
+| `test_rigor_short_rationale_reduces_score` | Rationale length matters |
+| `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 |
+| `test_feasibility_infeasible_protocol_scores_lower` | Bad < good |
+| `test_feasibility_accepts_precomputed_check` | Pre-computed = computed |
+| `test_feasibility_is_deterministic` | Same inputs → same output |
+| `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over |
+| `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned |
+| `test_fidelity_is_deterministic` | Same inputs → same output |
+| `test_fidelity_substitution_gets_partial_credit` | Sub > miss |
+| `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps |
+| `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations |
+| `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check |
+| `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency |

docs/map/tests.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Tests Map — `tests/`
-> 87 tests across 6 files. All passing.
 >
 > **Last verified:** 2026-03-07
@@ -12,17 +12,17 @@
 | `test_models.py` | 15 | All Pydantic model contracts |
 | `test_scenarios.py` | 8 | Scenario generation and determinism |
 | `test_validation.py` | 13 | Protocol validation checks |
-| `test_scientist_policy.py` | 18 | Parser, retry, formatter, baseline |
 | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
-| **Total** | **87** | |
 ## Missing Coverage (not yet implemented)
 | File (planned) | Would cover |
 |---------------|-------------|
-| `test_reward.py` | JDG 01-03 scoring functions |
 | `test_env.py` | ENV 01-11 real environment |
-| `test_server.py` | API endpoint integration tests |
 ---

 # Tests Map — `tests/`
+> 134 tests across 8 files. All passing.
 >
 > **Last verified:** 2026-03-07
 | `test_models.py` | 15 | All Pydantic model contracts |
 | `test_scenarios.py` | 8 | Scenario generation and determinism |
 | `test_validation.py` | 13 | Protocol validation checks |
+| `test_scientist_policy.py` | 18+ | Parser, retry, formatter, baseline, bounded tools |
 | `test_lab_manager_policy.py` | 13 | Feasibility, suggestion, response |
+| `test_reward.py` | 18 | JDG 01-03 scoring functions |
+| `test_server.py` | 5 | API endpoint integration |
+| **Total** | **134** | |
 ## Missing Coverage (not yet implemented)
 | File (planned) | Would cover |
 |---------------|-------------|
 | `test_env.py` | ENV 01-11 real environment |
 ---

replicalab/scoring/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Judge scoring engine — deterministic protocol evaluation."""
+from .feasibility import score_feasibility
+from .fidelity import score_fidelity
+from .rigor import score_rigor
+__all__ = [
+    "score_feasibility",
+    "score_fidelity",
+    "score_rigor",
+]

replicalab/scoring/feasibility.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""JDG 02 — Protocol feasibility score.
+Derives a continuous [0, 1] signal from the existing FeasibilityCheckResult
+produced by AGT 05. Does NOT rescore from scratch — this prevents drift
+between Lab Manager grounding and Judge scoring.
+Pure deterministic function — no LLM calls, no side effects.
+"""
+from __future__ import annotations
+from replicalab.agents.lab_manager_policy import (
+    FeasibilityCheckResult,
+    check_feasibility,
+)
+from replicalab.models import Protocol
+from replicalab.scenarios.templates import NormalizedScenarioPack
+def score_feasibility(
+    protocol: Protocol,
+    scenario: NormalizedScenarioPack,
+    check: FeasibilityCheckResult | None = None,
+) -> float:
+    """Score resource feasibility as a continuous signal.
+    Returns a float in [0.0, 1.0].
+    If *check* is not provided, calls ``check_feasibility(protocol, scenario)``
+    to compute it.  Passing a pre-computed check avoids redundant work.
+    Each of the 7 dimensions contributes equally (1/7).  Binary dimensions
+    score 0 or 1.  Continuous dimensions (budget, equipment, reagents, staff)
+    give partial credit based on how close the protocol is to passing.
+    """
+    if check is None:
+        check = check_feasibility(protocol, scenario)
+    weight = 1.0 / 7.0
+    lab = scenario.lab_manager_observation
+    scores = [
+        # Protocol — binary
+        1.0 if check.protocol.ok else 0.0,
+        # Budget — partial credit
+        _budget_score(check, lab.budget_remaining),
+        # Equipment — partial credit
+        _fraction_score(
+            protocol.required_equipment,
+            set(lab.equipment_available),
+        ) if not check.equipment.ok else 1.0,
+        # Reagents — partial credit
+        _fraction_score(
+            protocol.required_reagents,
+            set(lab.reagents_in_stock),
+        ) if not check.reagents.ok else 1.0,
+        # Schedule — binary
+        1.0 if check.schedule.ok else 0.0,
+        # Staff — partial credit
+        _staff_score(check, lab.staff_count),
+        # Policy — binary (hard constraint)
+        1.0 if check.policy.ok else 0.0,
+    ]
+    raw = sum(s * weight for s in scores)
+    return max(0.0, min(1.0, raw))
+# ---------------------------------------------------------------------------
+# Partial credit helpers
+# ---------------------------------------------------------------------------
+def _budget_score(check: FeasibilityCheckResult, budget_remaining: float) -> float:
+    """Continuous budget score: ratio of budget to estimated cost."""
+    if check.budget.ok:
+        return 1.0
+    if check.estimated_cost <= 0:
+        return 0.0
+    return max(0.0, min(1.0, budget_remaining / check.estimated_cost))
+def _staff_score(check: FeasibilityCheckResult, staff_count: int) -> float:
+    """Continuous staff score: ratio of available staff to required."""
+    if check.staff.ok:
+        return 1.0
+    if check.required_staff <= 0:
+        return 0.0
+    return max(0.0, min(1.0, staff_count / check.required_staff))
+def _fraction_score(required: list[str], available: set[str]) -> float:
+    """Fraction of required items that are in the available set."""
+    if not required:
+        return 1.0
+    available_lower = {item.lower().strip() for item in available}
+    matched = sum(1 for item in required if item.lower().strip() in available_lower)
+    return matched / len(required)

replicalab/scoring/fidelity.py ADDED Viewed

	@@ -0,0 +1,147 @@

+"""JDG 03 — Protocol fidelity score.
+Measures how closely the final protocol matches the hidden reference spec.
+The scientist never sees the hidden spec — this score is for the judge only.
+Substitution-aware: allowed substitutions get partial credit instead of 0.
+Pure deterministic function — no LLM calls, no side effects.
+"""
+from __future__ import annotations
+from replicalab.models import Protocol
+from replicalab.scenarios.templates import NormalizedScenarioPack
+from replicalab.utils.text import element_tokens, normalize_label
+def score_fidelity(
+    protocol: Protocol,
+    scenario: NormalizedScenarioPack,
+) -> float:
+    """Score adherence to the hidden reference specification.
+    Returns a float in [0.0, 1.0].
+    Breakdown (weights):
+      - Required element coverage:     0.50  (substitution-aware)
+      - Flexible element alignment:    0.20  (bonus, no penalty for missing)
+      - Target metric alignment:       0.20
+      - Technique appropriateness:     0.10
+    """
+    spec = scenario.hidden_reference_spec
+    protocol_text = _protocol_text_blob(protocol)
+    sub_index = _build_substitution_index(scenario)
+    required = _required_element_score(spec.required_elements, protocol_text, sub_index)
+    flexible = _flexible_element_score(spec.flexible_elements, protocol_text)
+    target = _target_metric_score(spec.target_metric, spec.target_value, protocol_text)
+    technique = _technique_score(spec.summary, protocol_text)
+    raw = (0.50 * required) + (0.20 * flexible) + (0.20 * target) + (0.10 * technique)
+    return max(0.0, min(1.0, raw))
+# ---------------------------------------------------------------------------
+# Sub-scores
+# ---------------------------------------------------------------------------
+def _required_element_score(
+    required_elements: list[str],
+    protocol_text: str,
+    sub_index: dict[str, list[str]],
+) -> float:
+    """Score coverage of required elements, with substitution partial credit.
+    Direct match: 1.0 credit per element.
+    Substitution match: 0.7 credit (allowed substitution present instead).
+    Miss: 0.0 credit.
+    """
+    if not required_elements:
+        return 1.0
+    total = 0.0
+    for element in required_elements:
+        if _text_matches(element, protocol_text):
+            total += 1.0
+        elif _substitution_matches(element, protocol_text, sub_index):
+            total += 0.7
+        # else: 0.0
+    return total / len(required_elements)
+def _flexible_element_score(
+    flexible_elements: list[str],
+    protocol_text: str,
+) -> float:
+    """Bonus for addressing flexible elements. No penalty for missing."""
+    if not flexible_elements:
+        return 1.0
+    matched = sum(1 for el in flexible_elements if _text_matches(el, protocol_text))
+    return matched / len(flexible_elements)
+def _target_metric_score(
+    target_metric: str,
+    target_value: str,
+    protocol_text: str,
+) -> float:
+    """Score for mentioning the target metric and value."""
+    score = 0.0
+    if _text_matches(target_metric, protocol_text):
+        score += 0.5
+    if _text_matches(target_value, protocol_text):
+        score += 0.5
+    return score
+def _technique_score(summary: str, protocol_text: str) -> float:
+    """Score for technique alignment with the hidden spec summary."""
+    return 1.0 if _text_matches(summary, protocol_text) else 0.0
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _protocol_text_blob(protocol: Protocol) -> str:
+    """Join all protocol text fields into one lowercase blob for matching."""
+    return " ".join([
+        protocol.technique,
+        protocol.rationale,
+        " ".join(protocol.controls),
+        " ".join(protocol.required_equipment),
+        " ".join(protocol.required_reagents),
+    ]).lower()
+def _text_matches(element: str, blob: str) -> bool:
+    """Check if any token from *element* appears in *blob*."""
+    tokens = element_tokens(element)
+    return any(token in blob for token in tokens) if tokens else False
+def _substitution_matches(
+    element: str,
+    protocol_text: str,
+    sub_index: dict[str, list[str]],
+) -> bool:
+    """Check if an allowed substitution for *element* appears in the protocol."""
+    norm = normalize_label(element)
+    alternatives = sub_index.get(norm, [])
+    return any(_text_matches(alt, protocol_text) for alt in alternatives)
+def _build_substitution_index(
+    scenario: NormalizedScenarioPack,
+) -> dict[str, list[str]]:
+    """Map normalized originals to their alternative labels."""
+    index: dict[str, list[str]] = {}
+    for sub in scenario.allowed_substitutions:
+        key = normalize_label(sub.original)
+        index.setdefault(key, []).append(sub.alternative)
+    return index

replicalab/scoring/rigor.py ADDED Viewed

	@@ -0,0 +1,114 @@

+"""JDG 01 — Protocol rigor score.
+Measures structural completeness and alignment to scenario requirements.
+Pure deterministic function — no LLM calls, no side effects.
+"""
+from __future__ import annotations
+from replicalab.models import Protocol
+from replicalab.scenarios.templates import NormalizedScenarioPack
+from replicalab.utils.text import element_tokens
+def score_rigor(protocol: Protocol, scenario: NormalizedScenarioPack) -> float:
+    """Score protocol structural quality and requirement coverage.
+    Returns a float in [0.0, 1.0].
+    Breakdown (weights):
+      - Structural completeness:       0.30
+      - Success criteria coverage:     0.40
+      - Required element coverage:     0.30
+    """
+    completeness = _structural_completeness(protocol)
+    criteria = _success_criteria_coverage(protocol, scenario)
+    elements = _required_element_coverage(protocol, scenario)
+    raw = (0.30 * completeness) + (0.40 * criteria) + (0.30 * elements)
+    return max(0.0, min(1.0, raw))
+# ---------------------------------------------------------------------------
+# Sub-scores
+# ---------------------------------------------------------------------------
+def _structural_completeness(protocol: Protocol) -> float:
+    """Score based on whether protocol fields are meaningfully populated."""
+    score = 0.0
+    total = 0.35
+    # sample_size present
+    if protocol.sample_size >= 1:
+        score += 0.05
+    # sample_size statistically meaningful
+    if protocol.sample_size >= 4:
+        score += 0.05
+    # at least one control
+    if len(protocol.controls) >= 1:
+        score += 0.05
+    # second control (stronger design)
+    if len(protocol.controls) >= 2:
+        score += 0.05
+    # technique specified
+    if protocol.technique:
+        score += 0.05
+    # duration allocated
+    if protocol.duration_days >= 1:
+        score += 0.05
+    # rationale is substantive (not a placeholder)
+    if len(protocol.rationale) > 20:
+        score += 0.05
+    return score / total
+def _success_criteria_coverage(
+    protocol: Protocol,
+    scenario: NormalizedScenarioPack,
+) -> float:
+    """Fraction of scenario success criteria addressed by protocol text."""
+    criteria = scenario.success_criteria
+    if not criteria:
+        return 1.0
+    protocol_text = _protocol_text_blob(protocol)
+    matched = sum(1 for c in criteria if _text_matches(c, protocol_text))
+    return matched / len(criteria)
+def _required_element_coverage(
+    protocol: Protocol,
+    scenario: NormalizedScenarioPack,
+) -> float:
+    """Fraction of hidden reference required elements addressed."""
+    required = scenario.hidden_reference_spec.required_elements
+    if not required:
+        return 1.0
+    protocol_text = _protocol_text_blob(protocol)
+    matched = sum(1 for el in required if _text_matches(el, protocol_text))
+    return matched / len(required)
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _protocol_text_blob(protocol: Protocol) -> str:
+    """Join all protocol text fields into one lowercase blob for matching."""
+    return " ".join([
+        protocol.technique,
+        protocol.rationale,
+        " ".join(protocol.controls),
+        " ".join(protocol.required_equipment),
+        " ".join(protocol.required_reagents),
+    ]).lower()
+def _text_matches(element: str, blob: str) -> bool:
+    """Check if any token from *element* appears in *blob*."""
+    tokens = element_tokens(element)
+    return any(token in blob for token in tokens) if tokens else False

replicalab/utils/text.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""Shared text-matching helpers used by validation and scoring modules."""
+from __future__ import annotations
+def normalize_label(label: str) -> str:
+    """Lowercase, strip, collapse whitespace for fuzzy matching."""
+    return " ".join(label.lower().split())
+def element_tokens(element: str) -> list[str]:
+    """Split a required-element description into searchable tokens.
+    Returns individual significant words (>= 3 chars) so that
+    "transaction cost assumption" matches a rationale containing
+    "transaction" or "cost".
+    """
+    return [word for word in element.lower().split() if len(word) >= 3]

replicalab/utils/validation.py CHANGED Viewed

@@ -16,6 +16,8 @@ from pydantic import BaseModel, ConfigDict, Field
 from replicalab.models import Protocol
 from replicalab.scenarios.templates import NormalizedScenarioPack
 # ---------------------------------------------------------------------------
@@ -264,25 +266,6 @@ def _check_required_element_coverage(
 # ---------------------------------------------------------------------------
-def _normalize(label: str) -> str:
-    """Lowercase, strip, collapse whitespace for fuzzy matching."""
-    return " ".join(label.lower().split())
-def _element_tokens(element: str) -> list[str]:
-    """Split a required-element description into searchable tokens.
-    Returns individual significant words (>= 3 chars) so that
-    "transaction cost assumption" matches a rationale containing
-    "transaction" or "cost".
-    """
-    return [
-        word
-        for word in element.lower().split()
-        if len(word) >= 3
-    ]
 def _substitution_alternatives(scenario: NormalizedScenarioPack) -> set[str]:
     """Return the normalized 'original' values from allowed substitutions."""
     return {

 from replicalab.models import Protocol
 from replicalab.scenarios.templates import NormalizedScenarioPack
+from replicalab.utils.text import element_tokens as _element_tokens
+from replicalab.utils.text import normalize_label as _normalize
 # ---------------------------------------------------------------------------
 # ---------------------------------------------------------------------------
 def _substitution_alternatives(scenario: NormalizedScenarioPack) -> set[str]:
     """Return the normalized 'original' values from allowed substitutions."""
     return {

tests/test_reward.py ADDED Viewed

	@@ -0,0 +1,303 @@

+"""Tests for JDG 01–03 scoring functions."""
+from __future__ import annotations
+from replicalab.agents.lab_manager_policy import check_feasibility
+from replicalab.models import Protocol
+from replicalab.scenarios import generate_scenario
+from replicalab.scoring import score_feasibility, score_fidelity, score_rigor
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _scenario(template: str = "ml_benchmark", difficulty: str = "easy"):
+    return generate_scenario(seed=42, template=template, difficulty=difficulty)
+def _good_protocol(scenario) -> Protocol:
+    """Build a well-formed protocol aligned to the scenario."""
+    lab = scenario.lab_manager_observation
+    spec = scenario.hidden_reference_spec
+    return Protocol(
+        sample_size=10,
+        controls=["baseline", "ablation"],
+        technique=spec.summary[:60] if spec.summary else "replication_plan",
+        duration_days=max(1, min(2, lab.time_limit_days)),
+        required_equipment=(
+            list(lab.equipment_available[:1])
+            if lab.equipment_available
+            else []
+        ),
+        required_reagents=(
+            list(lab.reagents_in_stock[:1])
+            if lab.reagents_in_stock
+            else []
+        ),
+        rationale=(
+            f"Plan addresses: {', '.join(spec.required_elements[:2])}. "
+            f"Target metric: {spec.target_metric}. "
+            f"Target value: {spec.target_value}. "
+            "Stay within budget and schedule."
+        ),
+    )
+def _bad_protocol() -> Protocol:
+    """Build a minimal protocol that misses most requirements."""
+    return Protocol(
+        sample_size=1,
+        controls=[],
+        technique="unknown_method",
+        duration_days=1,
+        required_equipment=[],
+        required_reagents=[],
+        rationale="No plan.",
+    )
+# ---------------------------------------------------------------------------
+# JDG 01 — score_rigor
+# ---------------------------------------------------------------------------
+def test_rigor_good_protocol_scores_higher_than_bad() -> None:
+    scenario = _scenario("ml_benchmark", "easy")
+    good = _good_protocol(scenario)
+    bad = _bad_protocol()
+    good_score = score_rigor(good, scenario)
+    bad_score = score_rigor(bad, scenario)
+    assert good_score > bad_score
+    assert 0.0 <= good_score <= 1.0
+    assert 0.0 <= bad_score <= 1.0
+def test_rigor_is_deterministic() -> None:
+    scenario = _scenario("ml_benchmark", "medium")
+    protocol = _good_protocol(scenario)
+    first = score_rigor(protocol, scenario)
+    second = score_rigor(protocol, scenario)
+    assert first == second
+def test_rigor_empty_controls_reduces_score() -> None:
+    scenario = _scenario("math_reasoning", "easy")
+    with_controls = _good_protocol(scenario)
+    without_controls = with_controls.model_copy(update={"controls": ["only_one"]})
+    score_with = score_rigor(with_controls, scenario)
+    score_without = score_rigor(without_controls, scenario)
+    assert score_with >= score_without
+def test_rigor_short_rationale_reduces_score() -> None:
+    scenario = _scenario("finance_trading", "easy")
+    good = _good_protocol(scenario)
+    short = good.model_copy(update={"rationale": "OK."})
+    assert score_rigor(good, scenario) > score_rigor(short, scenario)
+def test_rigor_all_domains_return_valid_range() -> None:
+    for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
+        for difficulty in ("easy", "medium", "hard"):
+            scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
+            protocol = _good_protocol(scenario)
+            score = score_rigor(protocol, scenario)
+            assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
+# ---------------------------------------------------------------------------
+# JDG 02 — score_feasibility
+# ---------------------------------------------------------------------------
+def test_feasibility_viable_protocol_scores_high() -> None:
+    scenario = _scenario("ml_benchmark", "easy")
+    protocol = _good_protocol(scenario)
+    score = score_feasibility(protocol, scenario)
+    assert score > 0.7
+    assert 0.0 <= score <= 1.0
+def test_feasibility_infeasible_protocol_scores_lower() -> None:
+    scenario = _scenario("ml_benchmark", "easy")
+    good = _good_protocol(scenario)
+    # Blow the budget and schedule
+    bad = good.model_copy(update={
+        "sample_size": 200,
+        "duration_days": scenario.lab_manager_observation.time_limit_days + 5,
+        "required_equipment": ["Imaginary Device"],
+    })
+    good_score = score_feasibility(good, scenario)
+    bad_score = score_feasibility(bad, scenario)
+    assert good_score > bad_score
+def test_feasibility_accepts_precomputed_check() -> None:
+    scenario = _scenario("finance_trading", "easy")
+    protocol = _good_protocol(scenario)
+    check = check_feasibility(protocol, scenario)
+    score_with = score_feasibility(protocol, scenario, check=check)
+    score_without = score_feasibility(protocol, scenario)
+    assert score_with == score_without
+def test_feasibility_is_deterministic() -> None:
+    scenario = _scenario("math_reasoning", "medium")
+    protocol = _good_protocol(scenario)
+    first = score_feasibility(protocol, scenario)
+    second = score_feasibility(protocol, scenario)
+    assert first == second
+def test_feasibility_partial_credit_for_near_budget() -> None:
+    """A protocol slightly over budget should score higher than one far over."""
+    scenario = _scenario("ml_benchmark", "easy")
+    good = _good_protocol(scenario)
+    slightly_over = good.model_copy(update={"sample_size": 40})
+    far_over = good.model_copy(update={"sample_size": 200})
+    score_slight = score_feasibility(slightly_over, scenario)
+    score_far = score_feasibility(far_over, scenario)
+    assert score_slight >= score_far
+def test_feasibility_all_domains_return_valid_range() -> None:
+    for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
+        for difficulty in ("easy", "medium", "hard"):
+            scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
+            protocol = _good_protocol(scenario)
+            score = score_feasibility(protocol, scenario)
+            assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
+# ---------------------------------------------------------------------------
+# JDG 03 — score_fidelity
+# ---------------------------------------------------------------------------
+def test_fidelity_aligned_protocol_scores_higher() -> None:
+    scenario = _scenario("ml_benchmark", "easy")
+    aligned = _good_protocol(scenario)
+    misaligned = _bad_protocol()
+    aligned_score = score_fidelity(aligned, scenario)
+    misaligned_score = score_fidelity(misaligned, scenario)
+    assert aligned_score > misaligned_score
+    assert 0.0 <= aligned_score <= 1.0
+    assert 0.0 <= misaligned_score <= 1.0
+def test_fidelity_is_deterministic() -> None:
+    scenario = _scenario("finance_trading", "hard")
+    protocol = _good_protocol(scenario)
+    first = score_fidelity(protocol, scenario)
+    second = score_fidelity(protocol, scenario)
+    assert first == second
+def test_fidelity_substitution_gets_partial_credit() -> None:
+    """Using an allowed substitution should score better than a total miss."""
+    scenario = _scenario("math_reasoning", "easy")
+    spec = scenario.hidden_reference_spec
+    # Find a required element that has a substitution
+    sub_map = {}
+    for sub in scenario.allowed_substitutions:
+        sub_map[sub.original.lower()] = sub.alternative
+    if not sub_map or not spec.required_elements:
+        return  # skip if no substitution exists in this scenario
+    # Build protocol that uses the substitution alternative
+    first_sub_original = list(sub_map.keys())[0]
+    first_sub_alt = sub_map[first_sub_original]
+    with_sub = _good_protocol(scenario).model_copy(update={
+        "rationale": f"We will use {first_sub_alt} instead. " + spec.target_metric,
+    })
+    without_anything = _bad_protocol()
+    score_sub = score_fidelity(with_sub, scenario)
+    score_miss = score_fidelity(without_anything, scenario)
+    assert score_sub > score_miss
+def test_fidelity_mentioning_target_metric_improves_score() -> None:
+    scenario = _scenario("ml_benchmark", "easy")
+    spec = scenario.hidden_reference_spec
+    with_metric = _good_protocol(scenario)
+    without_metric = with_metric.model_copy(update={
+        "rationale": "Generic plan without any specific metric mentioned.",
+    })
+    score_with = score_fidelity(with_metric, scenario)
+    score_without = score_fidelity(without_metric, scenario)
+    assert score_with >= score_without
+def test_fidelity_all_domains_return_valid_range() -> None:
+    for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
+        for difficulty in ("easy", "medium", "hard"):
+            scenario = generate_scenario(seed=99, template=template, difficulty=difficulty)
+            protocol = _good_protocol(scenario)
+            score = score_fidelity(protocol, scenario)
+            assert 0.0 <= score <= 1.0, f"{template}/{difficulty}: {score}"
+# ---------------------------------------------------------------------------
+# Cross-scorer consistency
+# ---------------------------------------------------------------------------
+def test_all_scores_between_zero_and_one_for_bad_protocol() -> None:
+    for template in ("ml_benchmark", "math_reasoning", "finance_trading"):
+        scenario = generate_scenario(seed=7, template=template, difficulty="hard")
+        bad = _bad_protocol()
+        r = score_rigor(bad, scenario)
+        fe = score_feasibility(bad, scenario)
+        fi = score_fidelity(bad, scenario)
+        assert 0.0 <= r <= 1.0, f"rigor {template}: {r}"
+        assert 0.0 <= fe <= 1.0, f"feasibility {template}: {fe}"
+        assert 0.0 <= fi <= 1.0, f"fidelity {template}: {fi}"
+def test_good_protocol_dominates_bad_on_rigor_and_fidelity() -> None:
+    """Good protocol beats bad on rigor and fidelity.
+    Feasibility is excluded: a protocol that asks for nothing is trivially
+    feasible (no equipment, no reagents → nothing can fail).  The other two
+    scores correctly penalize an empty plan.
+    """
+    scenario = _scenario("ml_benchmark", "easy")
+    good = _good_protocol(scenario)
+    bad = _bad_protocol()
+    assert score_rigor(good, scenario) > score_rigor(bad, scenario)
+    assert score_fidelity(good, scenario) > score_fidelity(bad, scenario)