Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /map /scoring.md

maxxie114

Initial HF Spaces deployment

80d8c84 2 days ago

preview code

raw

history blame contribute delete

10.4 kB

Scoring Map — `replicalab/scoring/`

Judge scoring engine for protocol evaluation. Pure deterministic functions — no model calls, no side effects.

Tasks implemented: JDG 01, JDG 02, JDG 03, JDG 04, JDG 05, JDG 06, JDG 08 Tasks remaining: JDG 07

Oracle Hybrid Note

The repo now includes an additive Oracle layer for richer scenario generation, optional Lab Manager narration, optional event injection, and post-mortem analysis. None of that replaces the files in replicalab/scoring/.

For RL training, this folder remains the canonical reward source:

deterministic
reproducible
testable
used by the environment for the actual scalar reward signal

Architecture

replicalab/scoring/
    __init__.py          # exports: score_rigor, score_feasibility, score_fidelity,
                         #          build_reward_breakdown, compute_total_reward
    rigor.py             # JDG 01 — protocol structural quality
    feasibility.py       # JDG 02 — resource feasibility (wraps AGT 05)
    fidelity.py          # JDG 03 — adherence to hidden reference spec
    rubric.py            # JDG 04-05 — total reward formula and breakdown builder
    explain.py           # JDG 06 — deterministic plain-English explanation

Current Reward Structure

The training signal now has two layers:

Terminal reward from replicalab/scoring/rubric.py
- 10 * rigor * feasibility * fidelity * parsimony
- plus bonuses
- minus named penalties
Step shaping reward from replicalab/env/replicalab_env.py
- information-gain bonus for novel questions
- protocol-delta and momentum bonuses for productive revisions
- contradiction, hallucination, stalling, regression, invalid-action, timeout, and no-agreement penalties

The judge remains deterministic. The terminal audit still explains the final RewardBreakdown, while cumulative episode reward now includes the per-step shaping applied inside the environment.

Shared Utilities

Token matching extracted into replicalab/utils/text.py:

normalize_label(label) -> str — lowercase, strip, collapse whitespace
element_tokens(element) -> list[str] — split into searchable tokens (3+ chars)

Used by: validation.py, rigor.py, fidelity.py

JDG 01 — `score_rigor(protocol, scenario) -> float`

File: rigor.py Range: [0.0, 1.0] Measures: structural completeness and alignment to scenario requirements.

Weight Breakdown

Component	Weight	Method
Structural completeness	0.30	Field population checks
Success criteria coverage	0.40	Token match vs `scenario.success_criteria`
Required element coverage	0.30	Token match vs `hidden_reference_spec.required_elements`

Structural Completeness (0.30)

Each check contributes equally (0.05 each, total 0.35, then normalized):

Check	Condition
Sample size present	`sample_size >= 1`
Sample size meaningful	`sample_size >= 4`
Has control	`len(controls) >= 1`
Has second control	`len(controls) >= 2`
Technique specified	`technique` non-empty
Duration allocated	`duration_days >= 1`
Substantive rationale	`len(rationale) > 20`

Internal Functions

Function	Purpose
`_structural_completeness(protocol)`	Field population score
`_success_criteria_coverage(protocol, scenario)`	Fraction of criteria matched
`_required_element_coverage(protocol, scenario)`	Fraction of elements matched
`_protocol_text_blob(protocol)`	Join all text fields for matching
`_text_matches(element, blob)`	Token overlap check

JDG 02 — `score_feasibility(protocol, scenario, check=None) -> float`

File: feasibility.py Range: [0.0, 1.0] Measures: whether the lab can execute this protocol.

Key Design: No Rescoring

Does NOT recompute feasibility from scratch. Derives score from FeasibilityCheckResult produced by AGT 05's check_feasibility(). If no pre-computed check is passed, calls check_feasibility() internally. This prevents drift between Lab Manager grounding and Judge scoring.

Weight Breakdown

7 dimensions, each worth 1/7:

Dimension	Type	Partial Credit Formula
Protocol	Binary	1.0 if ok, else 0.0
Budget	Continuous	`min(1.0, budget_remaining / estimated_cost)`
Equipment	Continuous	fraction of required items that are available
Reagents	Continuous	fraction of required items that are in stock
Schedule	Binary	1.0 if ok, else 0.0
Staff	Continuous	`min(1.0, staff_count / required_staff)`
Policy	Binary	1.0 if ok, else 0.0 (hard constraint)

Internal Functions

Function	Purpose
`_budget_score(check, budget_remaining)`	Continuous budget ratio
`_staff_score(check, staff_count)`	Continuous staff ratio
`_fraction_score(required, available)`	Generic item-availability fraction

JDG 03 — `score_fidelity(protocol, scenario) -> float`

File: fidelity.py Range: [0.0, 1.0] Measures: adherence to hidden_reference_spec (which the scientist never sees).

Weight Breakdown

Component	Weight	Method
Required element coverage	0.50	Substitution-aware token match
Flexible element alignment	0.20	Bonus only, no penalty
Target metric alignment	0.20	Token match vs metric + value
Technique appropriateness	0.10	Token match vs spec summary

Substitution-Aware Scoring

For required elements:

Direct match (token in protocol text): 1.0 credit
Substitution match (allowed alternative present): 0.7 credit
Miss: 0.0 credit

This is the key difference from JDG 01's element check.

Internal Functions

Function	Purpose
`_required_element_score(elements, text, sub_index)`	Substitution-aware coverage
`_flexible_element_score(elements, text)`	Bonus-only coverage
`_target_metric_score(metric, value, text)`	Metric + value matching
`_technique_score(summary, text)`	Summary alignment
`_protocol_text_blob(protocol)`	Join text fields
`_text_matches(element, blob)`	Token overlap
`_substitution_matches(element, text, sub_index)`	Check alternatives
`_build_substitution_index(scenario)`	Map originals → alternatives

JDG 04 — `compute_total_reward(breakdown) -> float`

File: rubric.py Formula: 10 × rigor × feasibility × fidelity + efficiency_bonus + communication_bonus − sum(penalties)

Returns a scalar reward from a RewardBreakdown object.

JDG 05 — `build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown`

File: rubric.py Composes: rigor (JDG 01) + feasibility (JDG 02) + fidelity (JDG 03) + efficiency bonus.

Efficiency Bonus

Max bonus: 1.0 (configurable via _MAX_EFFICIENCY_BONUS)
Formula: bonus × (max_rounds - rounds_used) / (max_rounds - 1)
Finishing in round 1 of 6 → maximum bonus; using all rounds → 0

Internal Functions

Function	Purpose
`compute_total_reward(breakdown)`	Apply the reward formula
`build_reward_breakdown(...)`	Compose all sub-scores into a breakdown
`_efficiency_bonus(rounds_used, max_rounds)`	Compute efficiency bonus

Not Yet Implemented

Bonuses & Penalties — JDG 07

communication_bonus: reward for clear negotiation (reserved)
penalties: policy violations, hallucinated resources, etc.

Data Consumed

Source	Used by	For what
`Protocol` (models.py)	All 3 scorers	The final agreed protocol
`NormalizedScenarioPack` (scenarios)	All 3 scorers	Constraints, resources, criteria
`HiddenReferenceSpec` (scenarios)	JDG 01, JDG 03	Required/flexible elements, target metric
`FeasibilityCheckResult` (agents)	JDG 02	7 dimension checks with partial credit
`AllowedSubstitution` (scenarios)	JDG 03	Partial credit for substitutions
`element_tokens` (utils/text.py)	JDG 01, JDG 03	Shared token extraction

Test Coverage — `tests/test_reward.py`

Test	What it verifies
`test_rigor_good_protocol_scores_higher_than_bad`	Quality ordering
`test_rigor_is_deterministic`	Same inputs → same output
`test_rigor_empty_controls_reduces_score`	Controls matter
`test_rigor_short_rationale_reduces_score`	Rationale length matters
`test_rigor_all_domains_return_valid_range`	[0,1] across all 9 combinations
`test_feasibility_viable_protocol_scores_high`	Good protocol > 0.7
`test_feasibility_infeasible_protocol_scores_lower`	Bad < good
`test_feasibility_accepts_precomputed_check`	Pre-computed = computed
`test_feasibility_is_deterministic`	Same inputs → same output
`test_feasibility_partial_credit_for_near_budget`	Slightly over > far over
`test_feasibility_all_domains_return_valid_range`	[0,1] across all 9 combinations
`test_fidelity_aligned_protocol_scores_higher`	Aligned > misaligned
`test_fidelity_is_deterministic`	Same inputs → same output
`test_fidelity_substitution_gets_partial_credit`	Sub > miss
`test_fidelity_mentioning_target_metric_improves_score`	Metric mention helps
`test_fidelity_all_domains_return_valid_range`	[0,1] across all 9 combinations
`test_all_scores_between_zero_and_one_for_bad_protocol`	Bounds check
`test_good_protocol_dominates_bad_on_rigor_and_fidelity`	Cross-scorer consistency
`test_good_protocol_beats_awful_protocol_on_all_scores_and_total_reward`	Good protocol beats a clearly infeasible protocol across all judge axes
`test_rigor_explicit_success_criteria_mentions_improve_score`	Success-criteria mentions improve rigor coverage
`test_feasibility_partial_equipment_credit_sits_between_full_and_total_miss`	Partial equipment availability yields intermediate feasibility credit
`test_fidelity_direct_match_beats_substitution_and_miss`	Fidelity prefers direct match over allowed substitution over a miss
`test_breakdown_matches_with_and_without_precomputed_feasibility_check`	Reward breakdown stays identical with or without an injected feasibility check

Scoring Map — replicalab/scoring/

Oracle Hybrid Note

Architecture

Current Reward Structure

Shared Utilities

JDG 01 — score_rigor(protocol, scenario) -> float

Weight Breakdown

Structural Completeness (0.30)

Internal Functions

JDG 02 — score_feasibility(protocol, scenario, check=None) -> float

Key Design: No Rescoring

Weight Breakdown

Internal Functions

JDG 03 — score_fidelity(protocol, scenario) -> float

Weight Breakdown

Substitution-Aware Scoring

Internal Functions

JDG 04 — compute_total_reward(breakdown) -> float

JDG 05 — build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown

Efficiency Bonus

Internal Functions

Not Yet Implemented

Bonuses & Penalties — JDG 07

Data Consumed

Test Coverage — tests/test_reward.py

Scoring Map — `replicalab/scoring/`

JDG 01 — `score_rigor(protocol, scenario) -> float`

JDG 02 — `score_feasibility(protocol, scenario, check=None) -> float`

JDG 03 — `score_fidelity(protocol, scenario) -> float`

JDG 04 — `compute_total_reward(breakdown) -> float`

JDG 05 — `build_reward_breakdown(protocol, scenario, rounds_used, max_rounds, *, check=None) -> RewardBreakdown`

Test Coverage — `tests/test_reward.py`