Spaces:
Running
Scoring Map β replicalab/scoring/
Judge scoring engine for protocol evaluation. Pure deterministic functions β no model calls, no side effects.
Tasks implemented: JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08 Tasks remaining: JDG 07
Oracle Hybrid Note
The repo now includes an additive Oracle layer for richer scenario generation,
optional Lab Manager narration, optional event injection, and post-mortem
analysis. None of that replaces the files in replicalab/scoring/.
For RL training, this folder remains the canonical reward source:
- deterministic
- reproducible
- testable
- used by the environment for the actual scalar reward signal
Architecture
replicalab/scoring/
__init__.py # exports: score_rigor, score_feasibility, score_fidelity,
# build_reward_breakdown, compute_total_reward
rigor.py # JDG 01 β protocol structural quality
feasibility.py # JDG 02 β resource feasibility (wraps AGT 05)
fidelity.py # JDG 03 β adherence to hidden reference spec
rubric.py # JDG 04-05 β total reward formula and breakdown builder
explain.py # JDG 06 β deterministic plain-English explanation
Current Reward Structure
The training signal now has two layers:
- Terminal reward from
replicalab/scoring/rubric.py10 * rigor * feasibility * fidelity * parsimony- plus bonuses
- minus named penalties
- Step shaping reward from
replicalab/env/replicalab_env.py- information-gain bonus for novel questions
- protocol-delta and momentum bonuses for productive revisions
- contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement penalties
The judge remains deterministic. The terminal audit still explains the final
RewardBreakdown, while cumulative episode reward now includes the per-step
shaping applied inside the environment.
Shared Utilities
Token matching extracted into replicalab/utils/text.py:
normalize_label(label) -> strβ lowercase, strip, collapse whitespaceelement_tokens(element) -> list[str]β split into searchable tokens (3+ chars)
Used by: validation.py, rigor.py, fidelity.py
JDG 01 β score_rigor(protocol, scenario) -> float
File: rigor.py
Range: [0.0, 1.0]
Measures: structural completeness and alignment to scenario requirements.
Weight Breakdown
| Component | Weight | Method |
|---|---|---|
| Structural completeness | 0.30 | Field population checks |
| Success criteria coverage | 0.40 | Token match vs scenario.success_criteria |
| Required element coverage | 0.30 | Token match vs hidden_reference_spec.required_elements |
Structural Completeness (0.30)
Each check contributes equally (0.05 each, total 0.35, then normalized):
| Check | Condition |
|---|---|
| Sample size present | sample_size >= 1 |
| Sample size meaningful | sample_size >= 4 |
| Has control | len(controls) >= 1 |
| Has second control | len(controls) >= 2 |
| Technique specified | technique non-empty |
| Duration allocated | duration_days >= 1 |
| Substantive rationale | len(rationale) > 20 |
Internal Functions
| Function | Purpose |
|---|---|
_structural_completeness(protocol) |
Field population score |
_success_criteria_coverage(protocol, scenario) |
Fraction of criteria matched |
_required_element_coverage(protocol, scenario) |
Fraction of elements matched |
_protocol_text_blob(protocol) |
Join all text fields for matching |
_text_matches(element, blob) |
Token overlap check |
JDG 02 β score_feasibility(protocol, scenario, check=None) -> float
File: feasibility.py
Range: [0.0, 1.0]
Measures: whether the lab can execute this protocol.
Key Design: No Rescoring
Does NOT recompute feasibility from scratch. Derives score from FeasibilityCheckResult
produced by AGT 05's check_feasibility(). If no pre-computed check is passed, calls
check_feasibility() internally. This prevents drift between Lab Manager grounding
and Judge scoring.
Weight Breakdown
7 dimensions, each worth 1/7:
| Dimension | Type | Partial Credit Formula |
|---|---|---|
| Protocol | Binary | 1.0 if ok, else 0.0 |
| Budget | Continuous | min(1.0, budget_remaining / estimated_cost) |
| Equipment | Continuous | fraction of required items that are available |
| Reagents | Continuous | fraction of required items that are in stock |
| Schedule | Binary | 1.0 if ok, else 0.0 |
| Staff | Continuous | min(1.0, staff_count / required_staff) |
| Policy | Binary | 1.0 if ok, else 0.0 (hard constraint) |
Internal Functions
| Function | Purpose |
|---|---|
_budget_score(check, budget_remaining) |
Continuous budget ratio |
_staff_score(check, staff_count) |
Continuous staff ratio |
_fraction_score(required, available) |
Generic item-availability fraction |
JDG 03 β score_fidelity(protocol, scenario) -> float
File: fidelity.py
Range: [0.0, 1.0]
Measures: adherence to hidden_reference_spec (which the scientist never sees).
Weight Breakdown
| Component | Weight | Method |
|---|---|---|
| Required element coverage | 0.50 | Substitution-aware token match |
| Flexible element alignment | 0.20 | Bonus only, no penalty |
| Target metric alignment | 0.20 | Token match vs metric + value |
| Technique appropriateness | 0.10 | Token match vs spec summary |
Substitution-Aware Scoring
For required elements:
- Direct match (token in protocol text): 1.0 credit
- Substitution match (allowed alternative present): 0.7 credit
- Miss: 0.0 credit
This is the key difference from JDG 01's element check.
Internal Functions
| Function | Purpose |
|---|---|
_required_element_score(elements, text, sub_index) |
Substitution-aware coverage |
_flexible_element_score(elements, text) |
Bonus-only coverage |
_target_metric_score(metric, value, text) |
Metric + value matching |
_technique_score(summary, text) |
Summary alignment |
_protocol_text_blob(protocol) |
Join text fields |
_text_matches(element, blob) |
Token overlap |
_substitution_matches(element, text, sub_index) |
Check alternatives |
_build_substitution_index(scenario) |
Map originals β alternatives |
JDG 04 β compute_total_reward(breakdown) -> float
File: rubric.py
Formula: 10 Γ rigor Γ feasibility Γ fidelity + efficiency_bonus + communication_bonus β sum(penalties)
Returns a scalar reward from a RewardBreakdown object.
JDG 05 β build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown
File: rubric.py
Composes: rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus.
Efficiency Bonus
- Max bonus: 1.0 (configurable via
_MAX_EFFICIENCY_BONUS) - Formula:
bonus Γ (max_rounds - rounds_used) / (max_rounds - 1) - Finishing in round 1 of 6 β maximum bonus; using all rounds β 0
Internal Functions
| Function | Purpose |
|---|---|
compute_total_reward(breakdown) |
Apply the reward formula |
build_reward_breakdown(...) |
Compose all sub-scores into a breakdown |
_efficiency_bonus(rounds_used, max_rounds) |
Compute efficiency bonus |
Not Yet Implemented
Bonuses & Penalties β JDG 07
communication_bonus: reward for clear negotiation (reserved)penalties: policy violations, hallucinated resources, etc.
Data Consumed
| Source | Used by | For what |
|---|---|---|
Protocol (models.py) |
All 3 scorers | The final agreed protocol |
NormalizedScenarioPack (scenarios) |
All 3 scorers | Constraints, resources, criteria |
HiddenReferenceSpec (scenarios) |
JDG 01, JDG 03 | Required/flexible elements, target metric |
FeasibilityCheckResult (agents) |
JDG 02 | 7 dimension checks with partial credit |
AllowedSubstitution (scenarios) |
JDG 03 | Partial credit for substitutions |
element_tokens (utils/text.py) |
JDG 01, JDG 03 | Shared token extraction |
Test Coverage β tests/test_reward.py
| Test | What it verifies |
|---|---|
test_rigor_good_protocol_scores_higher_than_bad |
Quality ordering |
test_rigor_is_deterministic |
Same inputs β same output |
test_rigor_empty_controls_reduces_score |
Controls matter |
test_rigor_short_rationale_reduces_score |
Rationale length matters |
test_rigor_all_domains_return_valid_range |
[0,1] across all 9 combinations |
test_feasibility_viable_protocol_scores_high |
Good protocol > 0.7 |
test_feasibility_infeasible_protocol_scores_lower |
Bad < good |
test_feasibility_accepts_precomputed_check |
Pre-computed = computed |
test_feasibility_is_deterministic |
Same inputs β same output |
test_feasibility_partial_credit_for_near_budget |
Slightly over > far over |
test_feasibility_all_domains_return_valid_range |
[0,1] across all 9 combinations |
test_fidelity_aligned_protocol_scores_higher |
Aligned > misaligned |
test_fidelity_is_deterministic |
Same inputs β same output |
test_fidelity_substitution_gets_partial_credit |
Sub > miss |
test_fidelity_mentioning_target_metric_improves_score |
Metric mention helps |
test_fidelity_all_domains_return_valid_range |
[0,1] across all 9 combinations |
test_all_scores_between_zero_and_one_for_bad_protocol |
Bounds check |
test_good_protocol_dominates_bad_on_rigor_and_fidelity |
Cross-scorer consistency |
test_good_protocol_beats_awful_protocol_on_all_scores_and_total_reward |
Good protocol beats a clearly infeasible protocol across all judge axes |
test_rigor_explicit_success_criteria_mentions_improve_score |
Success-criteria mentions improve rigor coverage |
test_feasibility_partial_equipment_credit_sits_between_full_and_total_miss |
Partial equipment availability yields intermediate feasibility credit |
test_fidelity_direct_match_beats_substitution_and_miss |
Fidelity prefers direct match over allowed substitution over a miss |
test_breakdown_matches_with_and_without_precomputed_feasibility_check |
Reward breakdown stays identical with or without an injected feasibility check |