Spaces:

evaleval
/

general-eval-card

Running

App Files Files Community

general-eval-card / docs /INTERPRETIVE_SIGNALS.md

evijit HF Staff

Add interpretive signals, corpus dashboard, and slice browser

bca888a 26 days ago

preview code

raw

history blame

32.6 kB

	# EvalCards interpretive signals — frontend implementation spec

	Status: ready to implement. Backend ships in `evaleval/eval_cards_backend_pipeline` PR #1 (merged `b05323c`). All field shapes below are stable and covered by the backend's test suite.

	Companion docs:
	- Spec source of truth: EvalCards Interpretive Signals v1.0 (Anka Reuel, Stanford). Section refs (§3, §4, …) below point at that doc.
	- Open backend questions: [evaleval/eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). None block frontend work — they may shift wording, not shape.

	---

	## 0. What this PR does at a glance

	The backend now annotates evaluation records with four interpretive signals:

	1. Reproducibility gap — per row. Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing `max_tokens`").
	2. Reporting completeness — per benchmark. What fraction of EvalCards-required documentation fields are populated? Surfaced as a `[0, 1]` score with a missing-field breakdown.
	3. Provenance — per row. Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this `(model, benchmark, metric)` group?
	4. Comparability — per `(model, benchmark, metric)` group. Two flavors: variant divergence (same model, same benchmark, different setups → diverging scores) and cross-party divergence (different orgs reporting → diverging scores).

	Plus a corpus-level rollup file (`corpus-aggregates.json`) for a stratified analytics page.

	The frontend's job: surface these signals in three places — row-level badges, per-eval / per-model summary panels, and a corpus dashboard view.

	---

	## 1. Where the new data lives

	All fields are new additions to existing artifacts. No artifact is removed or reshaped.

	\| Artifact \| New fields \|
	\|---\|---\|
	\| `evals/{id}.json` (`HFEvalDetail`) \| Per-row `evalcards.annotations` block on every `metrics[].model_results[]` and `subtasks[…].metrics[].model_results[]`. Plus eval-root `evalcards.annotations.reporting_completeness`, `evalcards.annotations.benchmark_comparability`, and three top-level summaries: `reproducibility_summary`, `provenance_summary`, `comparability_summary`. \|
	\| `models/{id}.json` (`HFModelDetail`) \| Per-row `evalcards.annotations` block on every `hierarchy_by_category[][].metrics[].model_results[]`. Plus three top-level summaries scoped to that model. \|
	\| `eval-list.json` / `eval-list-lite.json` (`HFEvalListEntry`) \| Three summaries per entry. \|
	\| `model-cards.json` / `model-cards-lite.json` (`HFModelCardEntry`) \| Three summaries per entry. \|
	\| `eval-hierarchy.json` (`EvalHierarchy`) \| Each family node and leaf node carries the three summaries (aggregated over evals under it). \|
	\| `corpus-aggregates.json` (NEW FILE) \| Stratified rollups for paper / dashboard use. \|
	\| `manifest.json` \| New entry in `summary_artifacts`: `corpus_aggregates: "corpus-aggregates.json"`. \|

	`signal_version` (currently `"1.0"`) is present on every annotation. Treat it as opaque; surface only in admin/debug.

	---

	## 2. TypeScript types to add

	Add to `lib/backend-artifacts.ts` (preferred — these are pipeline contract types):

	```ts
	// Spec §3
	export interface ReproducibilityGap {
	has_reproducibility_gap: boolean
	missing_fields: string[] // e.g. ["max_tokens"]
	required_field_count: number // 2 base + 2 if agentic on current runtime
	populated_field_count: number
	signal_version: string
	}

	// Spec §5
	export type ProvenanceSourceType =
	\| "first_party"
	\| "third_party"
	\| "collaborative"
	\| "unspecified"

	export interface Provenance {
	source_type: ProvenanceSourceType
	is_multi_source: boolean
	first_party_only: boolean // see §6.1 below for caveat
	distinct_reporting_organizations: number
	signal_version: string
	}

	// Spec §6.1
	export interface VariantDivergence {
	has_variant_divergence: boolean
	group_id: string // "{model_route_id}__{metric_summary_id}"
	divergence_magnitude: number
	threshold_used: number
	threshold_basis:
	\| "proportion_or_continuous_normalized"
	\| "percent"
	\| "range_5pct"
	\| "fallback_default"
	differing_setup_fields: Array<{ field: string; values: unknown[] }>
	scores_in_group: number[]
	this_triple_score: number \| null // this row's score within the group
	triple_count_in_group: number
	score_scale_anomaly: boolean
	group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
	signal_version: string
	}

	// Spec §6.2
	export interface CrossPartyDivergence {
	has_cross_party_divergence: boolean
	group_id: string
	divergence_magnitude: number
	threshold_used: number
	threshold_basis: VariantDivergence["threshold_basis"]
	scores_by_organization: Record<string, number> // display org name → score
	differing_setup_fields: Array<{ field: string; values: unknown[] }>
	organization_count: number
	group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
	signal_version: string
	}

	// Per-row annotation block (carried on every model_result row)
	export interface RowAnnotations {
	reproducibility_gap: ReproducibilityGap \| null
	provenance: Provenance \| null
	variant_divergence: VariantDivergence \| null
	cross_party_divergence: CrossPartyDivergence \| null
	}

	// Spec §4
	export interface ReportingCompleteness {
	completeness_score: number // [0, 1]
	total_fields_evaluated: number
	missing_required_fields: string[] // dotted paths
	partial_fields: Array<{
	field_path: string
	score: number // (0, 1) — strictly between
	populated_subitems: number
	total_subitems: number
	}>
	field_scores: Array<{
	field_path: string
	coverage_type: "full" \| "partial" \| "reserved"
	score: number // [0, 1]
	}>
	signal_version: string
	}

	export interface BenchmarkComparability {
	variant_divergence_groups: Array<{
	group_id: string
	model_route_id: string
	divergence_magnitude: number
	threshold_used: number
	threshold_basis: VariantDivergence["threshold_basis"]
	differing_setup_fields: VariantDivergence["differing_setup_fields"]
	}>
	cross_party_divergence_groups: Array<{
	group_id: string
	model_route_id: string
	divergence_magnitude: number
	threshold_used: number
	threshold_basis: VariantDivergence["threshold_basis"]
	scores_by_organization: Record<string, number>
	differing_setup_fields: VariantDivergence["differing_setup_fields"]
	}>
	}

	// Eval-root or model-root annotation block
	export interface EvalcardsAnnotations {
	reporting_completeness?: ReportingCompleteness
	benchmark_comparability?: BenchmarkComparability
	}

	// Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes)
	export interface ReproducibilitySummary {
	results_total: number
	has_reproducibility_gap_count: number
	populated_ratio_avg: number \| null // null when results_total == 0
	}

	export interface ProvenanceSummary {
	total_results: number
	total_groups: number
	multi_source_groups: number
	first_party_only_groups: number
	source_type_distribution: Record<ProvenanceSourceType, number>
	}

	export interface ComparabilitySummary {
	total_groups: number
	groups_with_variant_check: number // eligible groups (>=2 rows, differing setups, >=2 scored)
	groups_with_cross_party_check: number // eligible groups (>=2 named orgs)
	variant_divergent_count: number
	cross_party_divergent_count: number
	}

	export interface SignalSummaries {
	reproducibility_summary?: ReproducibilitySummary
	provenance_summary?: ProvenanceSummary
	comparability_summary?: ComparabilitySummary
	}

	// corpus-aggregates.json
	export interface CorpusAggregates {
	generated_at: string
	signal_version: string
	stratification_dimensions: ["category"]
	reproducibility: Stratified<ReproducibilityCorpusBlock>
	completeness: Stratified<CompletenessCorpusBlock>
	provenance: Stratified<ProvenanceCorpusBlock>
	comparability: Stratified<ComparabilityCorpusBlock>
	}

	export interface Stratified<T> {
	overall: T
	by_category: Record<string, T> // categories: agentic \| general \| knowledge \| reasoning \| safety \| other
	}

	export interface ReproducibilityCorpusBlock {
	total_triples: number
	triples_with_reproducibility_gap: number
	reproducibility_gap_rate: number \| null
	agentic_triples: number
	per_field_missingness: Record<string, {
	missing_count: number
	missing_rate: number \| null
	denominator: "all_triples" \| "agentic_only"
	denominator_count: number
	}>
	}

	export interface CompletenessCorpusBlock {
	total_benchmarks: number
	completeness_score_mean: number \| null
	completeness_score_median: number \| null
	per_field_population: Record<string, {
	mean_score: number
	populated_rate: number
	fully_populated_rate: number
	benchmark_count: number
	}>
	}

	export interface ProvenanceCorpusBlock {
	total_triples: number
	total_groups: number
	multi_source_groups: number
	multi_source_rate: number \| null
	first_party_only_groups: number
	first_party_only_rate: number \| null
	source_type_distribution: Record<ProvenanceSourceType, number>
	}

	export interface ComparabilityCorpusBlock {
	total_groups: number
	variant_eligible_groups: number
	variant_divergent_groups: number
	variant_divergence_rate: number \| null
	cross_party_eligible_groups: number
	cross_party_divergent_groups: number
	cross_party_divergence_rate: number \| null // commonly null on current corpus
	}
	```

	Then in `lib/hf-data.ts`:

	- Extend `HFEvalModelResult` (line ~522) with `evalcards?: { annotations?: RowAnnotations }`.
	- Extend `HFEvalDetail` (line ~556) with `evalcards?: { annotations?: EvalcardsAnnotations }` plus the three summary fields from `SignalSummaries`.
	- Extend `HFEvalListEntry` (line ~475) with `SignalSummaries` fields.
	- Extend `HFModelCardEntry` (line ~439) with `SignalSummaries` fields.
	- Extend `HFModelDetail` (line ~571) with `SignalSummaries` fields.
	- Extend `HFModelHierarchyMetric` (line ~616) — `model_results` already typed as `HFEvalModelResult`, so the per-row annotations propagate automatically.

	In `EvalHierarchy` types (`lib/backend-artifacts.ts` line ~54), add `SignalSummaries` to both `HierarchyFamily` and `HierarchyBenchmark`.

	All fields are optional at the type level — older cached snapshots won't have them, and the frontend should render gracefully when they're absent.

	---

	## 3. Data plumbing

	### 3.1 New fetcher + API route for corpus aggregates

	In `lib/hf-data.ts`, add after the existing fetchers (~line 866):

	```ts
	export async function fetchCorpusAggregates(): Promise<CorpusAggregates \| null> {
	return fetchHFJsonSafe<CorpusAggregates>("corpus-aggregates.json")
	}
	```

	Add to `scripts/cache-hf-data.mjs` `CACHE_ROOT_FILES` array: `"corpus-aggregates.json"`. (Mark it optional in `OPTIONAL_CACHE_ROOT_FILES` if shipping while the HF dataset upload is still rolling — once the backend pipeline next runs against the dataset, the file will appear.)

	Create `app/api/corpus-aggregates/route.ts`:

	```ts
	import { NextResponse } from "next/server"
	import { fetchCorpusAggregates } from "@/lib/hf-data"

	export async function GET() {
	const aggregates = await fetchCorpusAggregates()
	if (!aggregates) {
	return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 })
	}
	return NextResponse.json(aggregates)
	}
	```

	### 3.2 Rest of plumbing is automatic

	Existing fetchers (`fetchEvalDetail`, `fetchModelDetail`, `fetchEvalList`, `fetchModelCardsList`, `fetchEvalHierarchy`) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened.

	---

	## 4. UX components to build

	Build a small set of reusable signal components in `components/signals/`. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across `eval-detail.tsx`, `benchmark-detail.tsx`, `model-compare-dialog.tsx`, and the new corpus dashboard.

	```
	components/signals/
	├── reproducibility-badge.tsx
	├── provenance-badge.tsx // already partially exists in benchmark-detail.tsx — see §4.2
	├── variant-divergence-badge.tsx
	├── cross-party-divergence-badge.tsx
	├── reproducibility-panel.tsx // detail view — full missing-fields list
	├── completeness-panel.tsx // detail view — score bar + missing-field list
	├── comparability-panel.tsx // detail view — divergent groups list
	├── signals-row-badges.tsx // composite: renders all four row-level badges with proper spacing
	└── signal-tooltip.tsx // shared tooltip primitive
	```

	All badges should follow the existing tone conventions used by `getRelationshipBadgeTone` ([components/benchmark-detail.tsx:289](../components/benchmark-detail.tsx#L289)) and the `Badge` primitive in [components/ui/badge.tsx](../components/ui/badge.tsx).

	### 4.1 Row-level badges — placement

	Insert `<SignalsRowBadges annotations={modelResult.evalcards?.annotations} />` next to the score cell in:

	- Eval detail leaderboard table — [components/eval-detail.tsx:869-871](../components/eval-detail.tsx#L869-L871) (the `<TableCell className="text-right">` containing the score). Render badges below the score on a new line for desktop, hidden on mobile.
	- Benchmark detail rows — `components/benchmark-detail.tsx` renders score rows in several places (search for `formatRawScoreValue`); insert the same component.
	- Model compare dialog — [components/model-compare-dialog.tsx](../components/model-compare-dialog.tsx) score columns.

	Display rules — only badge for actionable states. Silence is meaningful here.

	\| Signal \| Show badge when \| Hide when \|
	\|---\|---\|---\|
	\| Reproducibility \| `has_reproducibility_gap === true` \| gap=false, or annotation absent \|
	\| Provenance \| `source_type` ∈ {`first_party`, `third_party`, `collaborative`} \| `source_type === "unspecified"` \|
	\| Variant divergence \| `variant_divergence !== null && has_variant_divergence === true` \| null (not applicable) or false (checked, fine) \|
	\| Cross-party divergence \| `cross_party_divergence !== null && has_cross_party_divergence === true` \| null (almost always on current corpus) or false \|

	`has_: false` means "we checked and it's fine" — silent success. `null` means "not applicable / not enough data" — also silent. Only divergent / gap-positive states warrant pixels.*

	Dedup rule. `variant_divergence` and `cross_party_divergence` are duplicated onto every row in the same group. If you render three rows from the same `group_id`, render the divergence badge on each row but the expanded panel (§4.4) only once at the group header.

	### 4.2 Provenance badge — reuse what's there

	[components/benchmark-detail.tsx:262-302](../components/benchmark-detail.tsx#L262-L302) already has `getRelationshipShortLabel` and `getRelationshipBadgeTone`. Extract these into `components/signals/provenance-badge.tsx` and import back into `benchmark-detail.tsx`. The new badge should also consume the new `Provenance` annotation when present (it carries `is_multi_source` and `first_party_only`, which the current implementation derives row-by-row from `source_metadata` alone).

	When `provenance.first_party_only === true`, show a small ⚠ subtle indicator on the first-party badge ("first-party only — no independent replication"). This is the headline use of the signal for policy-mode readers.

	### 4.3 Reproducibility badge — content rules

	Tooltip content depends on audience mode (`useAudienceMode()` from [components/audience-mode-provider.tsx:40](../components/audience-mode-provider.tsx#L40)):

	- Research mode: "Setup not fully documented. Missing: `max_tokens`, `eval_plan`."
	- Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is."

	Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" — the active runtime checks 2 base fields (`temperature`, `max_tokens`) plus 2 agentic fields (`eval_plan`, `eval_limits`) when the benchmark is agentic. Read counts off the annotation.

	### 4.4 Detail panels — placement

	#### Reproducibility panel
	The existing "Evaluation Provenance" panel in [components/eval-detail.tsx:952-998](../components/eval-detail.tsx#L952-L998) (rendered when a row is expanded) is the right place for the per-row reproducibility breakdown. Add a new `DetailPanel` adjacent to it:

	```tsx
	{rowAnnotations?.reproducibility_gap && (
	<DetailPanel
	title={isResearchView ? "Reproducibility" : "Re-runnability"}
	subtitle={
	isResearchView
	? "Whether the setup is documented well enough for someone else to re-run."
	: "Whether someone could re-run this evaluation with the information available."
	}
	>
	<MetaRow
	label="Setup fields recorded"
	value={`${rowAnnotations.reproducibility_gap.populated_field_count} of ${rowAnnotations.reproducibility_gap.required_field_count}`}
	/>
	{rowAnnotations.reproducibility_gap.missing_fields.length > 0 && (
	<MetaRow
	label="Missing"
	value={rowAnnotations.reproducibility_gap.missing_fields.join(", ")}
	/>
	)}
	</DetailPanel>
	)}
	```

	#### Completeness panel
	Render at the eval-detail header level (above the leaderboard, below the metric specification card). New `<CompletenessPanel completeness={detail.evalcards?.annotations?.reporting_completeness} />`. UI: progress bar showing `completeness_score`, label "{N} of {M} fields populated" where N = sum of `field_scores[].score` rounded, M = `total_fields_evaluated`. Below: collapsible accordions:

	- Missing required fields (count badge) — list of `missing_required_fields` with friendly labels (see §6.4 for label mapping).
	- Partially populated (count badge) — `partial_fields` rendered as "{field}: {populated_subitems}/{total_subitems}".

	In policy mode, don't show the dotted-path field names — show friendly labels only. In research mode, show both.

	#### Comparability panel
	Also at eval-detail header level. Sourced from `detail.evalcards?.annotations?.benchmark_comparability`. Render as two collapsibles — "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use `model_route_id` from each group entry as anchor — add `id={"row-" + model_route_id}` on the leaderboard row).

	When both arrays are empty, hide the panel entirely. When `comparability_summary.groups_with_cross_party_check === 0` (the common state), surface a small note: "No third-party reports available for cross-party comparison."

	### 4.5 Per-eval header chips
	On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around [components/eval-detail.tsx:486-525](../components/eval-detail.tsx#L486-L525)), add a fourth chip when `evalcards.annotations.reporting_completeness` is present:

	> Documentation
	> {round(completeness_score * 100)}%

	Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark."

	### 4.6 Per-model card chips
	On `components/eval-card.tsx` and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at [components/eval-card.tsx:250](../components/eval-card.tsx#L250) ("Some results lack generation settings; compare scores with care.") with a data-driven version:

	> {has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented.

	Show only when `has_reproducibility_gap_count > 0`. The hand-written hint was a placeholder for exactly this signal — wire it up.

	---

	## 5. New page: corpus dashboard

	Add `app/corpus/page.tsx` (linked from main navigation [components/navigation.tsx](../components/navigation.tsx)). Server component that calls `fetchCorpusAggregates()` and renders four sections:

	### 5.1 Reproducibility section
	- Headline number: `reproducibility_gap_rate` rendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores."
	- Per-field horizontal bar chart from `per_field_missingness`. Bar denominator depends on `denominator` field: agentic-only fields use `agentic_triples`, others use `total_triples`. Label each bar with the denominator type so users understand.
	- Toggle: `overall` ↔ `by_category` (rendered as a small-multiple grid, one panel per category).

	### 5.2 Completeness section
	- Headline: `completeness_score_mean` (and median) across `total_benchmarks`.
	- Histogram of per-benchmark scores (pull individual benchmark scores from `eval-list.json` `reporting_completeness.completeness_score`, since corpus-aggregates only carries mean/median).
	- Per-field bar chart from `per_field_population` — three bars per field: `mean_score`, `populated_rate`, `fully_populated_rate`. (See §6.7 for which one to highlight per coverage type.)

	### 5.3 Provenance section
	- Stacked bar of `source_type_distribution` (across all triples).
	- Two ratios: `multi_source_rate`, `first_party_only_rate`. Label both: "% of (model, benchmark, metric) groups."

	### 5.4 Comparability section
	- Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null).
	- When `cross_party_divergence_rate === null`: show a "Not enough multi-org coverage to compute" empty state, not "0%". Same for `variant_divergence_rate === null`. This is critical — see §6.5.

	All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default).

	---

	## 6. Caveats and edge cases (read these before implementing)

	### 6.1 `first_party_only` semantics
	A row can be `first_party_only: true` even when `is_multi_source: false`. The spec literal: a group with one named org reporting first-party gets the badge. Don't read it as "exclusive coverage" — read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source."

	If `distinct_reporting_organizations === 0` (all rows have null org), `first_party_only` is `false` even when `source_type === "first_party"`. Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge.

	### 6.2 Active reproducibility field set is reduced
	The spec describes four base fields (`temperature`, `top_p`, `max_tokens`, `prompt_template`); the active backend currently checks only `temperature` and `max_tokens` plus `eval_plan` / `eval_limits` for agentic benchmarks. Don't hardcode "4 fields" anywhere. Always read `required_field_count` off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface.

	### 6.3 Missing-field path strings
	`missing_fields` for reproducibility uses bare names (e.g. `"max_tokens"`). `missing_required_fields` for completeness uses dotted paths (e.g. `"autobenchmarkcard.methodology.baseline_results"`). Different conventions, intentional. Build a small label map for completeness paths — paths come from [registry/completeness_fields.json](https://github.com/evaleval/eval_cards_backend_pipeline/blob/main/registry/completeness_fields.json) on the backend repo. Suggested label rules:

	- Drop the `autobenchmarkcard.` / `eee_eval.` / `evalcards.` prefix.
	- Replace dots with " / ", underscore with space, title-case.
	- Example: `autobenchmarkcard.methodology.baseline_results` → "Methodology / Baseline results".

	### 6.4 `differing_setup_fields[].values` may contain null and mixed types
	Per spec §6.1.4, `null` is a distinct value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render `null` as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with `JSON.stringify` for objects, plain text otherwise.

	### 6.5 `null` rates in comparability are not zero
	Eligibility-aware denominators mean `variant_divergence_rate` and `cross_party_divergence_rate` are `null` when no groups were eligible. Render as "N/A — not enough data" or an empty-state card, never as "0%". On the current corpus, `cross_party_divergence_rate` will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error.

	### 6.6 Score-scale anomaly flag
	`variant_divergence.score_scale_anomaly === true` indicates the metric was declared `proportion` but scores fell outside [0, 1] — usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples.

	### 6.7 `mean_score` vs `populated_rate` for completeness
	Per-field aggregates expose three numbers. Pick which to display based on `coverage_type`:

	- `full` and `reserved` fields — `mean_score` and `populated_rate` are equal. Show one number labeled "% of benchmarks populating this field."
	- `partial` fields — they diverge. `populated_rate` = % of benchmarks with any sub-item; `mean_score` = average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items."

	### 6.8 No `computed_at` on per-record annotations
	Only `signal_version` is on each annotation. For "last computed" UI text, use `manifest.json → generated_at` from the existing `BackendManifest`.

	### 6.9 Stratification categories
	`by_category` keys are: `agentic`, `general`, `knowledge`, `reasoning`, `safety`, `other`. Same set as the existing `category` field on evals — reuse whatever color scheme is currently keyed off `inferCategoryFromBenchmark` ([lib/benchmark-schema.ts](../lib/benchmark-schema.ts)).

	### 6.10 Annotation block can be `null` or absent
	`evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence}` can each be `null` independently, and the entire `evalcards` block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The `RowAnnotations` type intentionally types each subfield as `T \| null` (not `T \| undefined`) because the backend writes explicit `null`.

	---

	## 7. Suggested implementation order

	1. Types + plumbing (1–2 hours): types in `backend-artifacts.ts` + `hf-data.ts`, the `fetchCorpusAggregates` fetcher, the API route, and adding `corpus-aggregates.json` to the cache script. No UI yet.
	2. Row-level badges (½ day): build `signals/` directory with the four badge components, the dedup-aware `signals-row-badges.tsx`, and wire into eval-detail and benchmark-detail. This is the most visible win.
	3. Per-eval completeness panel + comparability panel (½ day): single benchmark, easy to design around. New `CompletenessPanel` is the headline new UX in this set.
	4. Per-row reproducibility detail panel (1–2 hours): drops into the existing expanded row layout.
	5. Per-eval / per-model header chips + replace the hand-written gap hint (1–2 hours): wires the summary fields into existing card surfaces.
	6. Corpus dashboard page (1–2 days): new route, new components, biggest scope. Defer until 1–5 are live and reviewed.

	Each step is independently shippable. Steps 1–5 can land before the corpus dashboard is designed.

	---

	## 8. Out of scope (don't do these yet)

	- Filter / sort the eval list by signal state ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want.
	- Side-by-side score comparison with divergence overlay. The data supports it (`scores_in_group`, `scores_by_organization`) but the design space is large. Hold off until we see the row-level badges in use.
	- Recompute / verification UI for missing reproducibility fields. Backend-side; out of scope here.
	- Per-instance sample-level badges. Signals operate at row / benchmark level; sample-level instance data is unaffected.

	---

	## 9. Reference: minimal real-shape examples

	Per-row `evalcards.annotations` with all four signals populated:

	```jsonc
	{
	"reproducibility_gap": {
	"has_reproducibility_gap": true,
	"missing_fields": ["max_tokens"],
	"required_field_count": 2,
	"populated_field_count": 1,
	"signal_version": "1.0"
	},
	"provenance": {
	"source_type": "first_party",
	"is_multi_source": false,
	"first_party_only": true,
	"distinct_reporting_organizations": 1,
	"signal_version": "1.0"
	},
	"variant_divergence": null,
	"cross_party_divergence": null
	}
	```

	Per-eval `evalcards.annotations` with completeness + comparability:

	```jsonc
	{
	"reporting_completeness": {
	"completeness_score": 0.62,
	"total_fields_evaluated": 28,
	"missing_required_fields": [
	"autobenchmarkcard.methodology.baseline_results",
	"autobenchmarkcard.methodology.validation",
	"evalcards.preregistration_url"
	],
	"partial_fields": [
	{ "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 }
	],
	"field_scores": [/* 28 entries */],
	"signal_version": "1.0"
	},
	"benchmark_comparability": {
	"variant_divergence_groups": [
	{
	"group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy",
	"model_route_id": "openai__gpt-5",
	"divergence_magnitude": 0.12,
	"threshold_used": 0.05,
	"threshold_basis": "proportion_or_continuous_normalized",
	"differing_setup_fields": [
	{ "field": "max_tokens", "values": [2048, 4096, 8192] }
	]
	}
	],
	"cross_party_divergence_groups": []
	}
	}
	```

	Top-level `provenance_summary` example:

	```jsonc
	{
	"total_results": 142,
	"total_groups": 47,
	"multi_source_groups": 3,
	"first_party_only_groups": 30,
	"source_type_distribution": {
	"first_party": 120,
	"third_party": 18,
	"collaborative": 0,
	"unspecified": 4
	}
	}
	```

	`corpus-aggregates.json` structure (top of file):

	```jsonc
	{
	"generated_at": "2026-04-27T...",
	"signal_version": "1.0",
	"stratification_dimensions": ["category"],
	"reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } },
	"completeness": { "overall": {/* CompletenessCorpusBlock */}, "by_category": {...} },
	"provenance": { "overall": {/* ProvenanceCorpusBlock */}, "by_category": {...} },
	"comparability": { "overall": {/* ComparabilityCorpusBlock */}, "by_category": {...} }
	}
	```

	---

	## 10. Audience-mode wording cheatsheet

	\| Element \| Research mode \| Policy mode \|
	\|---\|---\|---\|
	\| Reproducibility gap badge \| "Reproducibility gap" \| "Setup not documented" \|
	\| Reproducibility tooltip \| "Setup not fully documented. Missing: {fields}." \| "This score's setup isn't documented, so it can't be re-run as-is." \|
	\| Reproducibility panel title \| "Reproducibility" \| "Re-runnability" \|
	\| Completeness chip label \| "Documentation" \| "Documentation" \|
	\| Completeness panel title \| "Reporting completeness" \| "How well is this benchmark documented?" \|
	\| Provenance: first-party \| "1st party" \| "Reported by model developer" \|
	\| Provenance: first-party only \| "1st party only — no replication" \| "Only the model developer reported this score" \|
	\| Provenance: third-party \| "3rd party" \| "Independently reported" \|
	\| Provenance: collaborative \| "Collaborative" \| "Joint report" \|
	\| Variant divergence badge \| "Variant divergence" \| "Score depends on setup" \|
	\| Variant divergence tooltip \| "Scores diverge by {magnitude} across different setups: {fields}." \| "Different runs of this evaluation produced different scores — the setup matters." \|
	\| Cross-party divergence badge \| "Cross-party divergence" \| "Sources disagree" \|
	\| Cross-party divergence tooltip \| "Reports diverge by {magnitude} across organizations." \| "Different organizations reported different scores for this same model on this same benchmark." \|

	Adjust tone but keep the underlying numbers identical across modes — the data is the same, only the framing changes.

	---

	Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics → [eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). Questions on UX → discuss with @anka-evals + frontend team.