Harmonize the reproducibility-fields allowlist between the signal and the per-row card

#5
by evijit - opened
EvalEval Coalition org

Problem

The Reproducibility signal score (in BenchmarkSignalsStrip) and the per-row
"Reproducibility" panel (in ResearcherReproducibilityCard) historically used
two different field sets:

  • Signal score β€” counts only temperature + max_tokens (plus eval_plan
    • eval_limits for agentic benchmarks). Restricted to two fields per
      BASE_REQUIRED_FIELDS because, in the live EEE corpus, the rest weren't
      reliably populated.
  • Per-row card β€” rendered ~15 fields across Decoding, Sampling, Scoring &
    Uncertainty, Agent setup, plus a Prompt-template block.

This made the two views disagree visually: a row could read "0/15 disclosed"
in the dropdown while the strip above showed a different ratio because it was
only checking 2 fields. As of [this commit] we narrowed the dropdown to the
same allowlist as the signal so they agree, but that hides real metadata that
the producer sometimes does ship (top-p, seed, sample-size, prompt template,
…).

What we want long-term

Either:

  1. Expand the signal allowlist as the corpus matures β€” add fields to
    BASE_REQUIRED_FIELDS once they're populated on, say, β‰₯80% of recent
    submissions, and re-show those rows in the dropdown.
  2. Split the per-row card into a "scored fields" block (counted by the
    signal) and a "context fields" block (informational, not counted) β€” so a
    reader sees the rich metadata when it exists without it changing the
    reproducibility ratio.

Option 2 is cheaper to ship and probably the right move; option 1 is the
right move once we have data to back the inclusion criteria.

Files / pointers

  • components/signals/benchmark-signals-strip.tsx β€” BASE_REQUIRED_FIELDS,
    AGENTIC_REQUIRED_FIELDS, deriveReproducibility.
  • components/researcher-reproducibility-card.tsx β€” requiredFieldLabels
    filter, TODO(repro-allowlist) comment.
  • Producer side: the EEE schema lists more fields than these two as required;
    see evaleval/EEE_datastore field-completeness reports for which ones are
    populated where.

Acceptance

  • The signal score and the per-row dropdown count the same fields and agree
    on the disclosure ratio.
  • The richer metadata (prompt template, top-p, seed, sample size, …) is still
    visible to a researcher who expands a row, but rendered in a way that
    doesn't pretend to influence the score.
  • Both views update together when the allowlist changes β€” no second source
    of truth.

Sign up or log in to comment