Spaces:

evaleval
/

general-eval-card

Running

Harmonize the reproducibility-fields allowlist between the signal and the per-row card

by evijit - opened 13 days ago

EvalEval Coalition org 13 days ago

Problem

The Reproducibility signal score (in BenchmarkSignalsStrip) and the per-row
"Reproducibility" panel (in ResearcherReproducibilityCard) historically used
two different field sets:

Signal score — counts only temperature + max_tokens (plus eval_plan
- eval_limits for agentic benchmarks). Restricted to two fields per
  BASE_REQUIRED_FIELDS because, in the live EEE corpus, the rest weren't
  reliably populated.
Per-row card — rendered ~15 fields across Decoding, Sampling, Scoring &
Uncertainty, Agent setup, plus a Prompt-template block.

This made the two views disagree visually: a row could read "0/15 disclosed"
in the dropdown while the strip above showed a different ratio because it was
only checking 2 fields. As of [this commit] we narrowed the dropdown to the
same allowlist as the signal so they agree, but that hides real metadata that
the producer sometimes does ship (top-p, seed, sample-size, prompt template,
…).

What we want long-term

Either:

Expand the signal allowlist as the corpus matures — add fields to
BASE_REQUIRED_FIELDS once they're populated on, say, ≥80% of recent
submissions, and re-show those rows in the dropdown.
Split the per-row card into a "scored fields" block (counted by the
signal) and a "context fields" block (informational, not counted) — so a
reader sees the rich metadata when it exists without it changing the
reproducibility ratio.

Option 2 is cheaper to ship and probably the right move; option 1 is the
right move once we have data to back the inclusion criteria.

Files / pointers

components/signals/benchmark-signals-strip.tsx — BASE_REQUIRED_FIELDS,
AGENTIC_REQUIRED_FIELDS, deriveReproducibility.
components/researcher-reproducibility-card.tsx — requiredFieldLabels
filter, TODO(repro-allowlist) comment.
Producer side: the EEE schema lists more fields than these two as required;
see evaleval/EEE_datastore field-completeness reports for which ones are
populated where.

Acceptance

The signal score and the per-row dropdown count the same fields and agree
on the disclosure ratio.
The richer metadata (prompt template, top-p, seed, sample size, …) is still
visible to a researcher who expands a row, but rendered in a way that
doesn't pretend to influence the score.
Both views update together when the allowlist changes — no second source
of truth.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment