Spaces:
Running
Running
Harmonize the reproducibility-fields allowlist between the signal and the per-row card
#5
by evijit - opened
Problem
The Reproducibility signal score (in BenchmarkSignalsStrip) and the per-row
"Reproducibility" panel (in ResearcherReproducibilityCard) historically used
two different field sets:
- Signal score β counts only
temperature+max_tokens(pluseval_planeval_limitsfor agentic benchmarks). Restricted to two fields perBASE_REQUIRED_FIELDSbecause, in the live EEE corpus, the rest weren't
reliably populated.
- Per-row card β rendered ~15 fields across Decoding, Sampling, Scoring &
Uncertainty, Agent setup, plus a Prompt-template block.
This made the two views disagree visually: a row could read "0/15 disclosed"
in the dropdown while the strip above showed a different ratio because it was
only checking 2 fields. As of [this commit] we narrowed the dropdown to the
same allowlist as the signal so they agree, but that hides real metadata that
the producer sometimes does ship (top-p, seed, sample-size, prompt template,
β¦).
What we want long-term
Either:
- Expand the signal allowlist as the corpus matures β add fields to
BASE_REQUIRED_FIELDSonce they're populated on, say, β₯80% of recent
submissions, and re-show those rows in the dropdown. - Split the per-row card into a "scored fields" block (counted by the
signal) and a "context fields" block (informational, not counted) β so a
reader sees the rich metadata when it exists without it changing the
reproducibility ratio.
Option 2 is cheaper to ship and probably the right move; option 1 is the
right move once we have data to back the inclusion criteria.
Files / pointers
components/signals/benchmark-signals-strip.tsxβBASE_REQUIRED_FIELDS,AGENTIC_REQUIRED_FIELDS,deriveReproducibility.components/researcher-reproducibility-card.tsxβrequiredFieldLabels
filter,TODO(repro-allowlist)comment.- Producer side: the EEE schema lists more fields than these two as required;
seeevaleval/EEE_datastorefield-completeness reports for which ones are
populated where.
Acceptance
- The signal score and the per-row dropdown count the same fields and agree
on the disclosure ratio. - The richer metadata (prompt template, top-p, seed, sample size, β¦) is still
visible to a researcher who expands a row, but rendered in a way that
doesn't pretend to influence the score. - Both views update together when the allowlist changes β no second source
of truth.