replicalab / docs /map /scoring.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Scoring Map β€” replicalab/scoring/

Judge scoring engine for protocol evaluation. Pure deterministic functions β€” no model calls, no side effects.

Tasks implemented: JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08 Tasks remaining: JDG 07

Oracle Hybrid Note

The repo now includes an additive Oracle layer for richer scenario generation, optional Lab Manager narration, optional event injection, and post-mortem analysis. None of that replaces the files in replicalab/scoring/.

For RL training, this folder remains the canonical reward source:

  • deterministic
  • reproducible
  • testable
  • used by the environment for the actual scalar reward signal

Architecture

replicalab/scoring/
    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity,
                         #          build_reward_breakdown, compute_total_reward
    rigor.py             # JDG 01 β€” protocol structural quality
    feasibility.py       # JDG 02 β€” resource feasibility (wraps AGT 05)
    fidelity.py          # JDG 03 β€” adherence to hidden reference spec
    rubric.py            # JDG 04-05 β€” total reward formula and breakdown builder
    explain.py           # JDG 06 β€” deterministic plain-English explanation

Current Reward Structure

The training signal now has two layers:

  • Terminal reward from replicalab/scoring/rubric.py
    • 10 * rigor * feasibility * fidelity * parsimony
    • plus bonuses
    • minus named penalties
  • Step shaping reward from replicalab/env/replicalab_env.py
    • information-gain bonus for novel questions
    • protocol-delta and momentum bonuses for productive revisions
    • contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement penalties

The judge remains deterministic. The terminal audit still explains the final RewardBreakdown, while cumulative episode reward now includes the per-step shaping applied inside the environment.

Shared Utilities

Token matching extracted into replicalab/utils/text.py:

  • normalize_label(label) -> str β€” lowercase, strip, collapse whitespace
  • element_tokens(element) -> list[str] β€” split into searchable tokens (3+ chars)

Used by: validation.py, rigor.py, fidelity.py


JDG 01 β€” score_rigor(protocol, scenario) -> float

File: rigor.py Range: [0.0, 1.0] Measures: structural completeness and alignment to scenario requirements.

Weight Breakdown

Component Weight Method
Structural completeness 0.30 Field population checks
Success criteria coverage 0.40 Token match vs scenario.success_criteria
Required element coverage 0.30 Token match vs hidden_reference_spec.required_elements

Structural Completeness (0.30)

Each check contributes equally (0.05 each, total 0.35, then normalized):

Check Condition
Sample size present sample_size >= 1
Sample size meaningful sample_size >= 4
Has control len(controls) >= 1
Has second control len(controls) >= 2
Technique specified technique non-empty
Duration allocated duration_days >= 1
Substantive rationale len(rationale) > 20

Internal Functions

Function Purpose
_structural_completeness(protocol) Field population score
_success_criteria_coverage(protocol, scenario) Fraction of criteria matched
_required_element_coverage(protocol, scenario) Fraction of elements matched
_protocol_text_blob(protocol) Join all text fields for matching
_text_matches(element, blob) Token overlap check

JDG 02 β€” score_feasibility(protocol, scenario, check=None) -> float

File: feasibility.py Range: [0.0, 1.0] Measures: whether the lab can execute this protocol.

Key Design: No Rescoring

Does NOT recompute feasibility from scratch. Derives score from FeasibilityCheckResult produced by AGT 05's check_feasibility(). If no pre-computed check is passed, calls check_feasibility() internally. This prevents drift between Lab Manager grounding and Judge scoring.

Weight Breakdown

7 dimensions, each worth 1/7:

Dimension Type Partial Credit Formula
Protocol Binary 1.0 if ok, else 0.0
Budget Continuous min(1.0, budget_remaining / estimated_cost)
Equipment Continuous fraction of required items that are available
Reagents Continuous fraction of required items that are in stock
Schedule Binary 1.0 if ok, else 0.0
Staff Continuous min(1.0, staff_count / required_staff)
Policy Binary 1.0 if ok, else 0.0 (hard constraint)

Internal Functions

Function Purpose
_budget_score(check, budget_remaining) Continuous budget ratio
_staff_score(check, staff_count) Continuous staff ratio
_fraction_score(required, available) Generic item-availability fraction

JDG 03 β€” score_fidelity(protocol, scenario) -> float

File: fidelity.py Range: [0.0, 1.0] Measures: adherence to hidden_reference_spec (which the scientist never sees).

Weight Breakdown

Component Weight Method
Required element coverage 0.50 Substitution-aware token match
Flexible element alignment 0.20 Bonus only, no penalty
Target metric alignment 0.20 Token match vs metric + value
Technique appropriateness 0.10 Token match vs spec summary

Substitution-Aware Scoring

For required elements:

  • Direct match (token in protocol text): 1.0 credit
  • Substitution match (allowed alternative present): 0.7 credit
  • Miss: 0.0 credit

This is the key difference from JDG 01's element check.

Internal Functions

Function Purpose
_required_element_score(elements, text, sub_index) Substitution-aware coverage
_flexible_element_score(elements, text) Bonus-only coverage
_target_metric_score(metric, value, text) Metric + value matching
_technique_score(summary, text) Summary alignment
_protocol_text_blob(protocol) Join text fields
_text_matches(element, blob) Token overlap
_substitution_matches(element, text, sub_index) Check alternatives
_build_substitution_index(scenario) Map originals β†’ alternatives


JDG 04 β€” compute_total_reward(breakdown) -> float

File: rubric.py Formula: 10 Γ— rigor Γ— feasibility Γ— fidelity + efficiency_bonus + communication_bonus βˆ’ sum(penalties)

Returns a scalar reward from a RewardBreakdown object.

JDG 05 β€” build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown

File: rubric.py Composes: rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus.

Efficiency Bonus

  • Max bonus: 1.0 (configurable via _MAX_EFFICIENCY_BONUS)
  • Formula: bonus Γ— (max_rounds - rounds_used) / (max_rounds - 1)
  • Finishing in round 1 of 6 β†’ maximum bonus; using all rounds β†’ 0

Internal Functions

Function Purpose
compute_total_reward(breakdown) Apply the reward formula
build_reward_breakdown(...) Compose all sub-scores into a breakdown
_efficiency_bonus(rounds_used, max_rounds) Compute efficiency bonus

Not Yet Implemented

Bonuses & Penalties β€” JDG 07

  • communication_bonus: reward for clear negotiation (reserved)
  • penalties: policy violations, hallucinated resources, etc.

Data Consumed

Source Used by For what
Protocol (models.py) All 3 scorers The final agreed protocol
NormalizedScenarioPack (scenarios) All 3 scorers Constraints, resources, criteria
HiddenReferenceSpec (scenarios) JDG 01, JDG 03 Required/flexible elements, target metric
FeasibilityCheckResult (agents) JDG 02 7 dimension checks with partial credit
AllowedSubstitution (scenarios) JDG 03 Partial credit for substitutions
element_tokens (utils/text.py) JDG 01, JDG 03 Shared token extraction

Test Coverage β€” tests/test_reward.py

Test What it verifies
test_rigor_good_protocol_scores_higher_than_bad Quality ordering
test_rigor_is_deterministic Same inputs β†’ same output
test_rigor_empty_controls_reduces_score Controls matter
test_rigor_short_rationale_reduces_score Rationale length matters
test_rigor_all_domains_return_valid_range [0,1] across all 9 combinations
test_feasibility_viable_protocol_scores_high Good protocol > 0.7
test_feasibility_infeasible_protocol_scores_lower Bad < good
test_feasibility_accepts_precomputed_check Pre-computed = computed
test_feasibility_is_deterministic Same inputs β†’ same output
test_feasibility_partial_credit_for_near_budget Slightly over > far over
test_feasibility_all_domains_return_valid_range [0,1] across all 9 combinations
test_fidelity_aligned_protocol_scores_higher Aligned > misaligned
test_fidelity_is_deterministic Same inputs β†’ same output
test_fidelity_substitution_gets_partial_credit Sub > miss
test_fidelity_mentioning_target_metric_improves_score Metric mention helps
test_fidelity_all_domains_return_valid_range [0,1] across all 9 combinations
test_all_scores_between_zero_and_one_for_bad_protocol Bounds check
test_good_protocol_dominates_bad_on_rigor_and_fidelity Cross-scorer consistency
test_good_protocol_beats_awful_protocol_on_all_scores_and_total_reward Good protocol beats a clearly infeasible protocol across all judge axes
test_rigor_explicit_success_criteria_mentions_improve_score Success-criteria mentions improve rigor coverage
test_feasibility_partial_equipment_credit_sits_between_full_and_total_miss Partial equipment availability yields intermediate feasibility credit
test_fidelity_direct_match_beats_substitution_and_miss Fidelity prefers direct match over allowed substitution over a miss
test_breakdown_matches_with_and_without_precomputed_feasibility_check Reward breakdown stays identical with or without an injected feasibility check