Spaces:
Running
Running
| # Scoring Map β `replicalab/scoring/` | |
| > Judge scoring engine for protocol evaluation. | |
| > Pure deterministic functions β no model calls, no side effects. | |
| > | |
| > **Tasks implemented:** JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08 | |
| > **Tasks remaining:** JDG 07 | |
| ## Oracle Hybrid Note | |
| The repo now includes an additive Oracle layer for richer scenario generation, | |
| optional Lab Manager narration, optional event injection, and post-mortem | |
| analysis. None of that replaces the files in `replicalab/scoring/`. | |
| For RL training, this folder remains the canonical reward source: | |
| - deterministic | |
| - reproducible | |
| - testable | |
| - used by the environment for the actual scalar reward signal | |
| ## Architecture | |
| ``` | |
| replicalab/scoring/ | |
| __init__.py # exports: score_rigor, score_feasibility, score_fidelity, | |
| # build_reward_breakdown, compute_total_reward | |
| rigor.py # JDG 01 β protocol structural quality | |
| feasibility.py # JDG 02 β resource feasibility (wraps AGT 05) | |
| fidelity.py # JDG 03 β adherence to hidden reference spec | |
| rubric.py # JDG 04-05 β total reward formula and breakdown builder | |
| explain.py # JDG 06 β deterministic plain-English explanation | |
| ``` | |
| ## Current Reward Structure | |
| The training signal now has two layers: | |
| - **Terminal reward** from `replicalab/scoring/rubric.py` | |
| - `10 * rigor * feasibility * fidelity * parsimony` | |
| - plus bonuses | |
| - minus named penalties | |
| - **Step shaping reward** from `replicalab/env/replicalab_env.py` | |
| - information-gain bonus for novel questions | |
| - protocol-delta and momentum bonuses for productive revisions | |
| - contradiction, hallucination, stalling, regression, invalid-action, | |
| timeout, and no-agreement penalties | |
| The judge remains deterministic. The terminal audit still explains the final | |
| `RewardBreakdown`, while cumulative episode reward now includes the per-step | |
| shaping applied inside the environment. | |
| ## Shared Utilities | |
| Token matching extracted into `replicalab/utils/text.py`: | |
| - `normalize_label(label) -> str` β lowercase, strip, collapse whitespace | |
| - `element_tokens(element) -> list[str]` β split into searchable tokens (3+ chars) | |
| Used by: `validation.py`, `rigor.py`, `fidelity.py` | |
| --- | |
| ## JDG 01 β `score_rigor(protocol, scenario) -> float` | |
| **File:** `rigor.py` | |
| **Range:** [0.0, 1.0] | |
| **Measures:** structural completeness and alignment to scenario requirements. | |
| ### Weight Breakdown | |
| | Component | Weight | Method | | |
| |-----------|--------|--------| | |
| | Structural completeness | 0.30 | Field population checks | | |
| | Success criteria coverage | 0.40 | Token match vs `scenario.success_criteria` | | |
| | Required element coverage | 0.30 | Token match vs `hidden_reference_spec.required_elements` | | |
| ### Structural Completeness (0.30) | |
| Each check contributes equally (0.05 each, total 0.35, then normalized): | |
| | Check | Condition | | |
| |-------|-----------| | |
| | Sample size present | `sample_size >= 1` | | |
| | Sample size meaningful | `sample_size >= 4` | | |
| | Has control | `len(controls) >= 1` | | |
| | Has second control | `len(controls) >= 2` | | |
| | Technique specified | `technique` non-empty | | |
| | Duration allocated | `duration_days >= 1` | | |
| | Substantive rationale | `len(rationale) > 20` | | |
| ### Internal Functions | |
| | Function | Purpose | | |
| |----------|---------| | |
| | `_structural_completeness(protocol)` | Field population score | | |
| | `_success_criteria_coverage(protocol, scenario)` | Fraction of criteria matched | | |
| | `_required_element_coverage(protocol, scenario)` | Fraction of elements matched | | |
| | `_protocol_text_blob(protocol)` | Join all text fields for matching | | |
| | `_text_matches(element, blob)` | Token overlap check | | |
| --- | |
| ## JDG 02 β `score_feasibility(protocol, scenario, check=None) -> float` | |
| **File:** `feasibility.py` | |
| **Range:** [0.0, 1.0] | |
| **Measures:** whether the lab can execute this protocol. | |
| ### Key Design: No Rescoring | |
| Does NOT recompute feasibility from scratch. Derives score from `FeasibilityCheckResult` | |
| produced by AGT 05's `check_feasibility()`. If no pre-computed check is passed, calls | |
| `check_feasibility()` internally. This prevents drift between Lab Manager grounding | |
| and Judge scoring. | |
| ### Weight Breakdown | |
| 7 dimensions, each worth 1/7: | |
| | Dimension | Type | Partial Credit Formula | | |
| |-----------|------|----------------------| | |
| | Protocol | Binary | 1.0 if ok, else 0.0 | | |
| | Budget | Continuous | `min(1.0, budget_remaining / estimated_cost)` | | |
| | Equipment | Continuous | fraction of required items that are available | | |
| | Reagents | Continuous | fraction of required items that are in stock | | |
| | Schedule | Binary | 1.0 if ok, else 0.0 | | |
| | Staff | Continuous | `min(1.0, staff_count / required_staff)` | | |
| | Policy | Binary | 1.0 if ok, else 0.0 (hard constraint) | | |
| ### Internal Functions | |
| | Function | Purpose | | |
| |----------|---------| | |
| | `_budget_score(check, budget_remaining)` | Continuous budget ratio | | |
| | `_staff_score(check, staff_count)` | Continuous staff ratio | | |
| | `_fraction_score(required, available)` | Generic item-availability fraction | | |
| --- | |
| ## JDG 03 β `score_fidelity(protocol, scenario) -> float` | |
| **File:** `fidelity.py` | |
| **Range:** [0.0, 1.0] | |
| **Measures:** adherence to `hidden_reference_spec` (which the scientist never sees). | |
| ### Weight Breakdown | |
| | Component | Weight | Method | | |
| |-----------|--------|--------| | |
| | Required element coverage | 0.50 | Substitution-aware token match | | |
| | Flexible element alignment | 0.20 | Bonus only, no penalty | | |
| | Target metric alignment | 0.20 | Token match vs metric + value | | |
| | Technique appropriateness | 0.10 | Token match vs spec summary | | |
| ### Substitution-Aware Scoring | |
| For required elements: | |
| - **Direct match** (token in protocol text): 1.0 credit | |
| - **Substitution match** (allowed alternative present): 0.7 credit | |
| - **Miss**: 0.0 credit | |
| This is the key difference from JDG 01's element check. | |
| ### Internal Functions | |
| | Function | Purpose | | |
| |----------|---------| | |
| | `_required_element_score(elements, text, sub_index)` | Substitution-aware coverage | | |
| | `_flexible_element_score(elements, text)` | Bonus-only coverage | | |
| | `_target_metric_score(metric, value, text)` | Metric + value matching | | |
| | `_technique_score(summary, text)` | Summary alignment | | |
| | `_protocol_text_blob(protocol)` | Join text fields | | |
| | `_text_matches(element, blob)` | Token overlap | | |
| | `_substitution_matches(element, text, sub_index)` | Check alternatives | | |
| | `_build_substitution_index(scenario)` | Map originals β alternatives | | |
| --- | |
| --- | |
| ## JDG 04 β `compute_total_reward(breakdown) -> float` | |
| **File:** `rubric.py` | |
| **Formula:** `10 Γ rigor Γ feasibility Γ fidelity + efficiency_bonus + communication_bonus β sum(penalties)` | |
| Returns a scalar reward from a `RewardBreakdown` object. | |
| ## JDG 05 β `build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown` | |
| **File:** `rubric.py` | |
| **Composes:** rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus. | |
| ### Efficiency Bonus | |
| - Max bonus: 1.0 (configurable via `_MAX_EFFICIENCY_BONUS`) | |
| - Formula: `bonus Γ (max_rounds - rounds_used) / (max_rounds - 1)` | |
| - Finishing in round 1 of 6 β maximum bonus; using all rounds β 0 | |
| ### Internal Functions | |
| | Function | Purpose | | |
| |----------|---------| | |
| | `compute_total_reward(breakdown)` | Apply the reward formula | | |
| | `build_reward_breakdown(...)` | Compose all sub-scores into a breakdown | | |
| | `_efficiency_bonus(rounds_used, max_rounds)` | Compute efficiency bonus | | |
| --- | |
| ## Not Yet Implemented | |
| ### Bonuses & Penalties β JDG 07 | |
| - `communication_bonus`: reward for clear negotiation (reserved) | |
| - `penalties`: policy violations, hallucinated resources, etc. | |
| ## Data Consumed | |
| | Source | Used by | For what | | |
| |--------|---------|----------| | |
| | `Protocol` (models.py) | All 3 scorers | The final agreed protocol | | |
| | `NormalizedScenarioPack` (scenarios) | All 3 scorers | Constraints, resources, criteria | | |
| | `HiddenReferenceSpec` (scenarios) | JDG 01, JDG 03 | Required/flexible elements, target metric | | |
| | `FeasibilityCheckResult` (agents) | JDG 02 | 7 dimension checks with partial credit | | |
| | `AllowedSubstitution` (scenarios) | JDG 03 | Partial credit for substitutions | | |
| | `element_tokens` (utils/text.py) | JDG 01, JDG 03 | Shared token extraction | | |
| ## Test Coverage β `tests/test_reward.py` | |
| | Test | What it verifies | | |
| |------|-----------------| | |
| | `test_rigor_good_protocol_scores_higher_than_bad` | Quality ordering | | |
| | `test_rigor_is_deterministic` | Same inputs β same output | | |
| | `test_rigor_empty_controls_reduces_score` | Controls matter | | |
| | `test_rigor_short_rationale_reduces_score` | Rationale length matters | | |
| | `test_rigor_all_domains_return_valid_range` | [0,1] across all 9 combinations | | |
| | `test_feasibility_viable_protocol_scores_high` | Good protocol > 0.7 | | |
| | `test_feasibility_infeasible_protocol_scores_lower` | Bad < good | | |
| | `test_feasibility_accepts_precomputed_check` | Pre-computed = computed | | |
| | `test_feasibility_is_deterministic` | Same inputs β same output | | |
| | `test_feasibility_partial_credit_for_near_budget` | Slightly over > far over | | |
| | `test_feasibility_all_domains_return_valid_range` | [0,1] across all 9 combinations | | |
| | `test_fidelity_aligned_protocol_scores_higher` | Aligned > misaligned | | |
| | `test_fidelity_is_deterministic` | Same inputs β same output | | |
| | `test_fidelity_substitution_gets_partial_credit` | Sub > miss | | |
| | `test_fidelity_mentioning_target_metric_improves_score` | Metric mention helps | | |
| | `test_fidelity_all_domains_return_valid_range` | [0,1] across all 9 combinations | | |
| | `test_all_scores_between_zero_and_one_for_bad_protocol` | Bounds check | | |
| | `test_good_protocol_dominates_bad_on_rigor_and_fidelity` | Cross-scorer consistency | | |
| | `test_good_protocol_beats_awful_protocol_on_all_scores_and_total_reward` | Good protocol beats a clearly infeasible protocol across all judge axes | | |
| | `test_rigor_explicit_success_criteria_mentions_improve_score` | Success-criteria mentions improve rigor coverage | | |
| | `test_feasibility_partial_equipment_credit_sits_between_full_and_total_miss` | Partial equipment availability yields intermediate feasibility credit | | |
| | `test_fidelity_direct_match_beats_substitution_and_miss` | Fidelity prefers direct match over allowed substitution over a miss | | |
| | `test_breakdown_matches_with_and_without_precomputed_feasibility_check` | Reward breakdown stays identical with or without an injected feasibility check | | |