general-eval-card / docs /INTERPRETIVE_SIGNALS.md
evijit's picture
evijit HF Staff
Add interpretive signals, corpus dashboard, and slice browser
bca888a
|
raw
history blame
32.6 kB

EvalCards interpretive signals β€” frontend implementation spec

Status: ready to implement. Backend ships in evaleval/eval_cards_backend_pipeline PR #1 (merged b05323c). All field shapes below are stable and covered by the backend's test suite.

Companion docs:

  • Spec source of truth: EvalCards Interpretive Signals v1.0 (Anka Reuel, Stanford). Section refs (Β§3, Β§4, …) below point at that doc.
  • Open backend questions: evaleval/eval_cards_backend_pipeline#2. None block frontend work β€” they may shift wording, not shape.

0. What this PR does at a glance

The backend now annotates evaluation records with four interpretive signals:

  1. Reproducibility gap β€” per row. Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing max_tokens").
  2. Reporting completeness β€” per benchmark. What fraction of EvalCards-required documentation fields are populated? Surfaced as a [0, 1] score with a missing-field breakdown.
  3. Provenance β€” per row. Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this (model, benchmark, metric) group?
  4. Comparability β€” per (model, benchmark, metric) group. Two flavors: variant divergence (same model, same benchmark, different setups β†’ diverging scores) and cross-party divergence (different orgs reporting β†’ diverging scores).

Plus a corpus-level rollup file (corpus-aggregates.json) for a stratified analytics page.

The frontend's job: surface these signals in three places β€” row-level badges, per-eval / per-model summary panels, and a corpus dashboard view.


1. Where the new data lives

All fields are new additions to existing artifacts. No artifact is removed or reshaped.

Artifact New fields
evals/{id}.json (HFEvalDetail) Per-row evalcards.annotations block on every metrics[].model_results[] and subtasks[…].metrics[].model_results[]. Plus eval-root evalcards.annotations.reporting_completeness, evalcards.annotations.benchmark_comparability, and three top-level summaries: reproducibility_summary, provenance_summary, comparability_summary.
models/{id}.json (HFModelDetail) Per-row evalcards.annotations block on every hierarchy_by_category[*][*].metrics[].model_results[]. Plus three top-level summaries scoped to that model.
eval-list.json / eval-list-lite.json (HFEvalListEntry) Three summaries per entry.
model-cards.json / model-cards-lite.json (HFModelCardEntry) Three summaries per entry.
eval-hierarchy.json (EvalHierarchy) Each family node and leaf node carries the three summaries (aggregated over evals under it).
corpus-aggregates.json (NEW FILE) Stratified rollups for paper / dashboard use.
manifest.json New entry in summary_artifacts: corpus_aggregates: "corpus-aggregates.json".

signal_version (currently "1.0") is present on every annotation. Treat it as opaque; surface only in admin/debug.


2. TypeScript types to add

Add to lib/backend-artifacts.ts (preferred β€” these are pipeline contract types):

// Spec Β§3
export interface ReproducibilityGap {
  has_reproducibility_gap: boolean
  missing_fields: string[]              // e.g. ["max_tokens"]
  required_field_count: number          // 2 base + 2 if agentic on current runtime
  populated_field_count: number
  signal_version: string
}

// Spec Β§5
export type ProvenanceSourceType =
  | "first_party"
  | "third_party"
  | "collaborative"
  | "unspecified"

export interface Provenance {
  source_type: ProvenanceSourceType
  is_multi_source: boolean
  first_party_only: boolean             // see Β§6.1 below for caveat
  distinct_reporting_organizations: number
  signal_version: string
}

// Spec Β§6.1
export interface VariantDivergence {
  has_variant_divergence: boolean
  group_id: string                      // "{model_route_id}__{metric_summary_id}"
  divergence_magnitude: number
  threshold_used: number
  threshold_basis:
    | "proportion_or_continuous_normalized"
    | "percent"
    | "range_5pct"
    | "fallback_default"
  differing_setup_fields: Array<{ field: string; values: unknown[] }>
  scores_in_group: number[]
  this_triple_score: number | null      // this row's score within the group
  triple_count_in_group: number
  score_scale_anomaly: boolean
  group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
  signal_version: string
}

// Spec Β§6.2
export interface CrossPartyDivergence {
  has_cross_party_divergence: boolean
  group_id: string
  divergence_magnitude: number
  threshold_used: number
  threshold_basis: VariantDivergence["threshold_basis"]
  scores_by_organization: Record<string, number>   // display org name β†’ score
  differing_setup_fields: Array<{ field: string; values: unknown[] }>
  organization_count: number
  group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
  signal_version: string
}

// Per-row annotation block (carried on every model_result row)
export interface RowAnnotations {
  reproducibility_gap: ReproducibilityGap | null
  provenance: Provenance | null
  variant_divergence: VariantDivergence | null
  cross_party_divergence: CrossPartyDivergence | null
}

// Spec Β§4
export interface ReportingCompleteness {
  completeness_score: number            // [0, 1]
  total_fields_evaluated: number
  missing_required_fields: string[]     // dotted paths
  partial_fields: Array<{
    field_path: string
    score: number                       // (0, 1) β€” strictly between
    populated_subitems: number
    total_subitems: number
  }>
  field_scores: Array<{
    field_path: string
    coverage_type: "full" | "partial" | "reserved"
    score: number                       // [0, 1]
  }>
  signal_version: string
}

export interface BenchmarkComparability {
  variant_divergence_groups: Array<{
    group_id: string
    model_route_id: string
    divergence_magnitude: number
    threshold_used: number
    threshold_basis: VariantDivergence["threshold_basis"]
    differing_setup_fields: VariantDivergence["differing_setup_fields"]
  }>
  cross_party_divergence_groups: Array<{
    group_id: string
    model_route_id: string
    divergence_magnitude: number
    threshold_used: number
    threshold_basis: VariantDivergence["threshold_basis"]
    scores_by_organization: Record<string, number>
    differing_setup_fields: VariantDivergence["differing_setup_fields"]
  }>
}

// Eval-root or model-root annotation block
export interface EvalcardsAnnotations {
  reporting_completeness?: ReportingCompleteness
  benchmark_comparability?: BenchmarkComparability
}

// Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes)
export interface ReproducibilitySummary {
  results_total: number
  has_reproducibility_gap_count: number
  populated_ratio_avg: number | null    // null when results_total == 0
}

export interface ProvenanceSummary {
  total_results: number
  total_groups: number
  multi_source_groups: number
  first_party_only_groups: number
  source_type_distribution: Record<ProvenanceSourceType, number>
}

export interface ComparabilitySummary {
  total_groups: number
  groups_with_variant_check: number     // eligible groups (>=2 rows, differing setups, >=2 scored)
  groups_with_cross_party_check: number // eligible groups (>=2 named orgs)
  variant_divergent_count: number
  cross_party_divergent_count: number
}

export interface SignalSummaries {
  reproducibility_summary?: ReproducibilitySummary
  provenance_summary?: ProvenanceSummary
  comparability_summary?: ComparabilitySummary
}

// corpus-aggregates.json
export interface CorpusAggregates {
  generated_at: string
  signal_version: string
  stratification_dimensions: ["category"]
  reproducibility: Stratified<ReproducibilityCorpusBlock>
  completeness:   Stratified<CompletenessCorpusBlock>
  provenance:     Stratified<ProvenanceCorpusBlock>
  comparability:  Stratified<ComparabilityCorpusBlock>
}

export interface Stratified<T> {
  overall: T
  by_category: Record<string, T>        // categories: agentic | general | knowledge | reasoning | safety | other
}

export interface ReproducibilityCorpusBlock {
  total_triples: number
  triples_with_reproducibility_gap: number
  reproducibility_gap_rate: number | null
  agentic_triples: number
  per_field_missingness: Record<string, {
    missing_count: number
    missing_rate: number | null
    denominator: "all_triples" | "agentic_only"
    denominator_count: number
  }>
}

export interface CompletenessCorpusBlock {
  total_benchmarks: number
  completeness_score_mean: number | null
  completeness_score_median: number | null
  per_field_population: Record<string, {
    mean_score: number
    populated_rate: number
    fully_populated_rate: number
    benchmark_count: number
  }>
}

export interface ProvenanceCorpusBlock {
  total_triples: number
  total_groups: number
  multi_source_groups: number
  multi_source_rate: number | null
  first_party_only_groups: number
  first_party_only_rate: number | null
  source_type_distribution: Record<ProvenanceSourceType, number>
}

export interface ComparabilityCorpusBlock {
  total_groups: number
  variant_eligible_groups: number
  variant_divergent_groups: number
  variant_divergence_rate: number | null
  cross_party_eligible_groups: number
  cross_party_divergent_groups: number
  cross_party_divergence_rate: number | null   // commonly null on current corpus
}

Then in lib/hf-data.ts:

  • Extend HFEvalModelResult (line ~522) with evalcards?: { annotations?: RowAnnotations }.
  • Extend HFEvalDetail (line ~556) with evalcards?: { annotations?: EvalcardsAnnotations } plus the three summary fields from SignalSummaries.
  • Extend HFEvalListEntry (line ~475) with SignalSummaries fields.
  • Extend HFModelCardEntry (line ~439) with SignalSummaries fields.
  • Extend HFModelDetail (line ~571) with SignalSummaries fields.
  • Extend HFModelHierarchyMetric (line ~616) β€” model_results already typed as HFEvalModelResult, so the per-row annotations propagate automatically.

In EvalHierarchy types (lib/backend-artifacts.ts line ~54), add SignalSummaries to both HierarchyFamily and HierarchyBenchmark.

All fields are optional at the type level β€” older cached snapshots won't have them, and the frontend should render gracefully when they're absent.


3. Data plumbing

3.1 New fetcher + API route for corpus aggregates

In lib/hf-data.ts, add after the existing fetchers (~line 866):

export async function fetchCorpusAggregates(): Promise<CorpusAggregates | null> {
  return fetchHFJsonSafe<CorpusAggregates>("corpus-aggregates.json")
}

Add to scripts/cache-hf-data.mjs CACHE_ROOT_FILES array: "corpus-aggregates.json". (Mark it optional in OPTIONAL_CACHE_ROOT_FILES if shipping while the HF dataset upload is still rolling β€” once the backend pipeline next runs against the dataset, the file will appear.)

Create app/api/corpus-aggregates/route.ts:

import { NextResponse } from "next/server"
import { fetchCorpusAggregates } from "@/lib/hf-data"

export async function GET() {
  const aggregates = await fetchCorpusAggregates()
  if (!aggregates) {
    return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 })
  }
  return NextResponse.json(aggregates)
}

3.2 Rest of plumbing is automatic

Existing fetchers (fetchEvalDetail, fetchModelDetail, fetchEvalList, fetchModelCardsList, fetchEvalHierarchy) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened.


4. UX components to build

Build a small set of reusable signal components in components/signals/. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across eval-detail.tsx, benchmark-detail.tsx, model-compare-dialog.tsx, and the new corpus dashboard.

components/signals/
β”œβ”€β”€ reproducibility-badge.tsx
β”œβ”€β”€ provenance-badge.tsx          // already partially exists in benchmark-detail.tsx β€” see Β§4.2
β”œβ”€β”€ variant-divergence-badge.tsx
β”œβ”€β”€ cross-party-divergence-badge.tsx
β”œβ”€β”€ reproducibility-panel.tsx      // detail view β€” full missing-fields list
β”œβ”€β”€ completeness-panel.tsx         // detail view β€” score bar + missing-field list
β”œβ”€β”€ comparability-panel.tsx        // detail view β€” divergent groups list
β”œβ”€β”€ signals-row-badges.tsx         // composite: renders all four row-level badges with proper spacing
└── signal-tooltip.tsx             // shared tooltip primitive

All badges should follow the existing tone conventions used by getRelationshipBadgeTone (components/benchmark-detail.tsx:289) and the Badge primitive in components/ui/badge.tsx.

4.1 Row-level badges β€” placement

Insert <SignalsRowBadges annotations={modelResult.evalcards?.annotations} /> next to the score cell in:

  • Eval detail leaderboard table β€” components/eval-detail.tsx:869-871 (the <TableCell className="text-right"> containing the score). Render badges below the score on a new line for desktop, hidden on mobile.
  • Benchmark detail rows β€” components/benchmark-detail.tsx renders score rows in several places (search for formatRawScoreValue); insert the same component.
  • Model compare dialog β€” components/model-compare-dialog.tsx score columns.

Display rules β€” only badge for actionable states. Silence is meaningful here.

Signal Show badge when Hide when
Reproducibility has_reproducibility_gap === true gap=false, or annotation absent
Provenance source_type ∈ {first_party, third_party, collaborative} source_type === "unspecified"
Variant divergence variant_divergence !== null && has_variant_divergence === true null (not applicable) or false (checked, fine)
Cross-party divergence cross_party_divergence !== null && has_cross_party_divergence === true null (almost always on current corpus) or false

has_*: false means "we checked and it's fine" β€” silent success. null means "not applicable / not enough data" β€” also silent. Only divergent / gap-positive states warrant pixels.

Dedup rule. variant_divergence and cross_party_divergence are duplicated onto every row in the same group. If you render three rows from the same group_id, render the divergence badge on each row but the expanded panel (Β§4.4) only once at the group header.

4.2 Provenance badge β€” reuse what's there

components/benchmark-detail.tsx:262-302 already has getRelationshipShortLabel and getRelationshipBadgeTone. Extract these into components/signals/provenance-badge.tsx and import back into benchmark-detail.tsx. The new badge should also consume the new Provenance annotation when present (it carries is_multi_source and first_party_only, which the current implementation derives row-by-row from source_metadata alone).

When provenance.first_party_only === true, show a small ⚠ subtle indicator on the first-party badge ("first-party only β€” no independent replication"). This is the headline use of the signal for policy-mode readers.

4.3 Reproducibility badge β€” content rules

Tooltip content depends on audience mode (useAudienceMode() from components/audience-mode-provider.tsx:40):

  • Research mode: "Setup not fully documented. Missing: max_tokens, eval_plan."
  • Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is."

Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" β€” the active runtime checks 2 base fields (temperature, max_tokens) plus 2 agentic fields (eval_plan, eval_limits) when the benchmark is agentic. Read counts off the annotation.

4.4 Detail panels β€” placement

Reproducibility panel

The existing "Evaluation Provenance" panel in components/eval-detail.tsx:952-998 (rendered when a row is expanded) is the right place for the per-row reproducibility breakdown. Add a new DetailPanel adjacent to it:

{rowAnnotations?.reproducibility_gap && (
  <DetailPanel
    title={isResearchView ? "Reproducibility" : "Re-runnability"}
    subtitle={
      isResearchView
        ? "Whether the setup is documented well enough for someone else to re-run."
        : "Whether someone could re-run this evaluation with the information available."
    }
  >
    <MetaRow
      label="Setup fields recorded"
      value={`${rowAnnotations.reproducibility_gap.populated_field_count} of ${rowAnnotations.reproducibility_gap.required_field_count}`}
    />
    {rowAnnotations.reproducibility_gap.missing_fields.length > 0 && (
      <MetaRow
        label="Missing"
        value={rowAnnotations.reproducibility_gap.missing_fields.join(", ")}
      />
    )}
  </DetailPanel>
)}

Completeness panel

Render at the eval-detail header level (above the leaderboard, below the metric specification card). New <CompletenessPanel completeness={detail.evalcards?.annotations?.reporting_completeness} />. UI: progress bar showing completeness_score, label "{N} of {M} fields populated" where N = sum of field_scores[].score rounded, M = total_fields_evaluated. Below: collapsible accordions:

  • Missing required fields (count badge) β€” list of missing_required_fields with friendly labels (see Β§6.4 for label mapping).
  • Partially populated (count badge) β€” partial_fields rendered as "{field}: {populated_subitems}/{total_subitems}".

In policy mode, don't show the dotted-path field names β€” show friendly labels only. In research mode, show both.

Comparability panel

Also at eval-detail header level. Sourced from detail.evalcards?.annotations?.benchmark_comparability. Render as two collapsibles β€” "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use model_route_id from each group entry as anchor β€” add id={"row-" + model_route_id} on the leaderboard row).

When both arrays are empty, hide the panel entirely. When comparability_summary.groups_with_cross_party_check === 0 (the common state), surface a small note: "No third-party reports available for cross-party comparison."

4.5 Per-eval header chips

On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around components/eval-detail.tsx:486-525), add a fourth chip when evalcards.annotations.reporting_completeness is present:

Documentation {round(completeness_score * 100)}%

Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark."

4.6 Per-model card chips

On components/eval-card.tsx and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at components/eval-card.tsx:250 ("Some results lack generation settings; compare scores with care.") with a data-driven version:

{has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented.

Show only when has_reproducibility_gap_count > 0. The hand-written hint was a placeholder for exactly this signal β€” wire it up.


5. New page: corpus dashboard

Add app/corpus/page.tsx (linked from main navigation components/navigation.tsx). Server component that calls fetchCorpusAggregates() and renders four sections:

5.1 Reproducibility section

  • Headline number: reproducibility_gap_rate rendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores."
  • Per-field horizontal bar chart from per_field_missingness. Bar denominator depends on denominator field: agentic-only fields use agentic_triples, others use total_triples. Label each bar with the denominator type so users understand.
  • Toggle: overall ↔ by_category (rendered as a small-multiple grid, one panel per category).

5.2 Completeness section

  • Headline: completeness_score_mean (and median) across total_benchmarks.
  • Histogram of per-benchmark scores (pull individual benchmark scores from eval-list.json reporting_completeness.completeness_score, since corpus-aggregates only carries mean/median).
  • Per-field bar chart from per_field_population β€” three bars per field: mean_score, populated_rate, fully_populated_rate. (See Β§6.7 for which one to highlight per coverage type.)

5.3 Provenance section

  • Stacked bar of source_type_distribution (across all triples).
  • Two ratios: multi_source_rate, first_party_only_rate. Label both: "% of (model, benchmark, metric) groups."

5.4 Comparability section

  • Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null).
  • When cross_party_divergence_rate === null: show a "Not enough multi-org coverage to compute" empty state, not "0%". Same for variant_divergence_rate === null. This is critical β€” see Β§6.5.

All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default).


6. Caveats and edge cases (read these before implementing)

6.1 first_party_only semantics

A row can be first_party_only: true even when is_multi_source: false. The spec literal: a group with one named org reporting first-party gets the badge. Don't read it as "exclusive coverage" β€” read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source."

If distinct_reporting_organizations === 0 (all rows have null org), first_party_only is false even when source_type === "first_party". Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge.

6.2 Active reproducibility field set is reduced

The spec describes four base fields (temperature, top_p, max_tokens, prompt_template); the active backend currently checks only temperature and max_tokens plus eval_plan / eval_limits for agentic benchmarks. Don't hardcode "4 fields" anywhere. Always read required_field_count off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface.

6.3 Missing-field path strings

missing_fields for reproducibility uses bare names (e.g. "max_tokens"). missing_required_fields for completeness uses dotted paths (e.g. "autobenchmarkcard.methodology.baseline_results"). Different conventions, intentional. Build a small label map for completeness paths β€” paths come from registry/completeness_fields.json on the backend repo. Suggested label rules:

  • Drop the autobenchmarkcard. / eee_eval. / evalcards. prefix.
  • Replace dots with " / ", underscore with space, title-case.
  • Example: autobenchmarkcard.methodology.baseline_results β†’ "Methodology / Baseline results".

6.4 differing_setup_fields[].values may contain null and mixed types

Per spec Β§6.1.4, null is a distinct value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render null as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with JSON.stringify for objects, plain text otherwise.

6.5 null rates in comparability are not zero

Eligibility-aware denominators mean variant_divergence_rate and cross_party_divergence_rate are null when no groups were eligible. Render as "N/A β€” not enough data" or an empty-state card, never as "0%". On the current corpus, cross_party_divergence_rate will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error.

6.6 Score-scale anomaly flag

variant_divergence.score_scale_anomaly === true indicates the metric was declared proportion but scores fell outside [0, 1] β€” usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples.

6.7 mean_score vs populated_rate for completeness

Per-field aggregates expose three numbers. Pick which to display based on coverage_type:

  • full and reserved fields β€” mean_score and populated_rate are equal. Show one number labeled "% of benchmarks populating this field."
  • partial fields β€” they diverge. populated_rate = % of benchmarks with any sub-item; mean_score = average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items."

6.8 No computed_at on per-record annotations

Only signal_version is on each annotation. For "last computed" UI text, use manifest.json β†’ generated_at from the existing BackendManifest.

6.9 Stratification categories

by_category keys are: agentic, general, knowledge, reasoning, safety, other. Same set as the existing category field on evals β€” reuse whatever color scheme is currently keyed off inferCategoryFromBenchmark (lib/benchmark-schema.ts).

6.10 Annotation block can be null or absent

evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence} can each be null independently, and the entire evalcards block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The RowAnnotations type intentionally types each subfield as T | null (not T | undefined) because the backend writes explicit null.


7. Suggested implementation order

  1. Types + plumbing (1–2 hours): types in backend-artifacts.ts + hf-data.ts, the fetchCorpusAggregates fetcher, the API route, and adding corpus-aggregates.json to the cache script. No UI yet.
  2. Row-level badges (Β½ day): build signals/ directory with the four badge components, the dedup-aware signals-row-badges.tsx, and wire into eval-detail and benchmark-detail. This is the most visible win.
  3. Per-eval completeness panel + comparability panel (Β½ day): single benchmark, easy to design around. New CompletenessPanel is the headline new UX in this set.
  4. Per-row reproducibility detail panel (1–2 hours): drops into the existing expanded row layout.
  5. Per-eval / per-model header chips + replace the hand-written gap hint (1–2 hours): wires the summary fields into existing card surfaces.
  6. Corpus dashboard page (1–2 days): new route, new components, biggest scope. Defer until 1–5 are live and reviewed.

Each step is independently shippable. Steps 1–5 can land before the corpus dashboard is designed.


8. Out of scope (don't do these yet)

  • Filter / sort the eval list by signal state ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want.
  • Side-by-side score comparison with divergence overlay. The data supports it (scores_in_group, scores_by_organization) but the design space is large. Hold off until we see the row-level badges in use.
  • Recompute / verification UI for missing reproducibility fields. Backend-side; out of scope here.
  • Per-instance sample-level badges. Signals operate at row / benchmark level; sample-level instance data is unaffected.

9. Reference: minimal real-shape examples

Per-row evalcards.annotations with all four signals populated:

{
  "reproducibility_gap": {
    "has_reproducibility_gap": true,
    "missing_fields": ["max_tokens"],
    "required_field_count": 2,
    "populated_field_count": 1,
    "signal_version": "1.0"
  },
  "provenance": {
    "source_type": "first_party",
    "is_multi_source": false,
    "first_party_only": true,
    "distinct_reporting_organizations": 1,
    "signal_version": "1.0"
  },
  "variant_divergence": null,
  "cross_party_divergence": null
}

Per-eval evalcards.annotations with completeness + comparability:

{
  "reporting_completeness": {
    "completeness_score": 0.62,
    "total_fields_evaluated": 28,
    "missing_required_fields": [
      "autobenchmarkcard.methodology.baseline_results",
      "autobenchmarkcard.methodology.validation",
      "evalcards.preregistration_url"
    ],
    "partial_fields": [
      { "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 }
    ],
    "field_scores": [/* 28 entries */],
    "signal_version": "1.0"
  },
  "benchmark_comparability": {
    "variant_divergence_groups": [
      {
        "group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy",
        "model_route_id": "openai__gpt-5",
        "divergence_magnitude": 0.12,
        "threshold_used": 0.05,
        "threshold_basis": "proportion_or_continuous_normalized",
        "differing_setup_fields": [
          { "field": "max_tokens", "values": [2048, 4096, 8192] }
        ]
      }
    ],
    "cross_party_divergence_groups": []
  }
}

Top-level provenance_summary example:

{
  "total_results": 142,
  "total_groups": 47,
  "multi_source_groups": 3,
  "first_party_only_groups": 30,
  "source_type_distribution": {
    "first_party": 120,
    "third_party": 18,
    "collaborative": 0,
    "unspecified": 4
  }
}

corpus-aggregates.json structure (top of file):

{
  "generated_at": "2026-04-27T...",
  "signal_version": "1.0",
  "stratification_dimensions": ["category"],
  "reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } },
  "completeness":   { "overall": {/* CompletenessCorpusBlock */},   "by_category": {...} },
  "provenance":     { "overall": {/* ProvenanceCorpusBlock */},     "by_category": {...} },
  "comparability":  { "overall": {/* ComparabilityCorpusBlock */},  "by_category": {...} }
}

10. Audience-mode wording cheatsheet

Element Research mode Policy mode
Reproducibility gap badge "Reproducibility gap" "Setup not documented"
Reproducibility tooltip "Setup not fully documented. Missing: {fields}." "This score's setup isn't documented, so it can't be re-run as-is."
Reproducibility panel title "Reproducibility" "Re-runnability"
Completeness chip label "Documentation" "Documentation"
Completeness panel title "Reporting completeness" "How well is this benchmark documented?"
Provenance: first-party "1st party" "Reported by model developer"
Provenance: first-party only "1st party only β€” no replication" "Only the model developer reported this score"
Provenance: third-party "3rd party" "Independently reported"
Provenance: collaborative "Collaborative" "Joint report"
Variant divergence badge "Variant divergence" "Score depends on setup"
Variant divergence tooltip "Scores diverge by {magnitude} across different setups: {fields}." "Different runs of this evaluation produced different scores β€” the setup matters."
Cross-party divergence badge "Cross-party divergence" "Sources disagree"
Cross-party divergence tooltip "Reports diverge by {magnitude} across organizations." "Different organizations reported different scores for this same model on this same benchmark."

Adjust tone but keep the underlying numbers identical across modes β€” the data is the same, only the framing changes.


Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics β†’ eval_cards_backend_pipeline#2. Questions on UX β†’ discuss with @anka-evals + frontend team.