Spaces:
Running
Running
| # EvalCards interpretive signals β frontend implementation spec | |
| **Status:** ready to implement. Backend ships in `evaleval/eval_cards_backend_pipeline` PR #1 (merged `b05323c`). All field shapes below are stable and covered by the backend's test suite. | |
| **Companion docs:** | |
| - Spec source of truth: *EvalCards Interpretive Signals v1.0* (Anka Reuel, Stanford). Section refs (Β§3, Β§4, β¦) below point at that doc. | |
| - Open backend questions: [evaleval/eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). None block frontend work β they may shift wording, not shape. | |
| --- | |
| ## 0. What this PR does at a glance | |
| The backend now annotates evaluation records with four interpretive signals: | |
| 1. **Reproducibility gap** β *per row.* Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing `max_tokens`"). | |
| 2. **Reporting completeness** β *per benchmark.* What fraction of EvalCards-required documentation fields are populated? Surfaced as a `[0, 1]` score with a missing-field breakdown. | |
| 3. **Provenance** β *per row.* Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this `(model, benchmark, metric)` group? | |
| 4. **Comparability** β *per `(model, benchmark, metric)` group.* Two flavors: **variant divergence** (same model, same benchmark, different setups β diverging scores) and **cross-party divergence** (different orgs reporting β diverging scores). | |
| Plus a corpus-level rollup file (`corpus-aggregates.json`) for a stratified analytics page. | |
| The frontend's job: surface these signals **in three places** β row-level badges, per-eval / per-model summary panels, and a corpus dashboard view. | |
| --- | |
| ## 1. Where the new data lives | |
| All fields are new additions to existing artifacts. No artifact is removed or reshaped. | |
| | Artifact | New fields | | |
| |---|---| | |
| | `evals/{id}.json` (`HFEvalDetail`) | Per-row `evalcards.annotations` block on every `metrics[].model_results[]` and `subtasks[β¦].metrics[].model_results[]`. Plus eval-root `evalcards.annotations.reporting_completeness`, `evalcards.annotations.benchmark_comparability`, and three top-level summaries: `reproducibility_summary`, `provenance_summary`, `comparability_summary`. | | |
| | `models/{id}.json` (`HFModelDetail`) | Per-row `evalcards.annotations` block on every `hierarchy_by_category[*][*].metrics[].model_results[]`. Plus three top-level summaries scoped to that model. | | |
| | `eval-list.json` / `eval-list-lite.json` (`HFEvalListEntry`) | Three summaries per entry. | | |
| | `model-cards.json` / `model-cards-lite.json` (`HFModelCardEntry`) | Three summaries per entry. | | |
| | `eval-hierarchy.json` (`EvalHierarchy`) | Each family node and leaf node carries the three summaries (aggregated over evals under it). | | |
| | **`corpus-aggregates.json` (NEW FILE)** | Stratified rollups for paper / dashboard use. | | |
| | `manifest.json` | New entry in `summary_artifacts`: `corpus_aggregates: "corpus-aggregates.json"`. | | |
| `signal_version` (currently `"1.0"`) is present on every annotation. Treat it as opaque; surface only in admin/debug. | |
| --- | |
| ## 2. TypeScript types to add | |
| Add to `lib/backend-artifacts.ts` (preferred β these are pipeline contract types): | |
| ```ts | |
| // Spec Β§3 | |
| export interface ReproducibilityGap { | |
| has_reproducibility_gap: boolean | |
| missing_fields: string[] // e.g. ["max_tokens"] | |
| required_field_count: number // 2 base + 2 if agentic on current runtime | |
| populated_field_count: number | |
| signal_version: string | |
| } | |
| // Spec Β§5 | |
| export type ProvenanceSourceType = | |
| | "first_party" | |
| | "third_party" | |
| | "collaborative" | |
| | "unspecified" | |
| export interface Provenance { | |
| source_type: ProvenanceSourceType | |
| is_multi_source: boolean | |
| first_party_only: boolean // see Β§6.1 below for caveat | |
| distinct_reporting_organizations: number | |
| signal_version: string | |
| } | |
| // Spec Β§6.1 | |
| export interface VariantDivergence { | |
| has_variant_divergence: boolean | |
| group_id: string // "{model_route_id}__{metric_summary_id}" | |
| divergence_magnitude: number | |
| threshold_used: number | |
| threshold_basis: | |
| | "proportion_or_continuous_normalized" | |
| | "percent" | |
| | "range_5pct" | |
| | "fallback_default" | |
| differing_setup_fields: Array<{ field: string; values: unknown[] }> | |
| scores_in_group: number[] | |
| this_triple_score: number | null // this row's score within the group | |
| triple_count_in_group: number | |
| score_scale_anomaly: boolean | |
| group_variant_breakdown: Array<{ variant_key: string; row_count: number }> | |
| signal_version: string | |
| } | |
| // Spec Β§6.2 | |
| export interface CrossPartyDivergence { | |
| has_cross_party_divergence: boolean | |
| group_id: string | |
| divergence_magnitude: number | |
| threshold_used: number | |
| threshold_basis: VariantDivergence["threshold_basis"] | |
| scores_by_organization: Record<string, number> // display org name β score | |
| differing_setup_fields: Array<{ field: string; values: unknown[] }> | |
| organization_count: number | |
| group_variant_breakdown: Array<{ variant_key: string; row_count: number }> | |
| signal_version: string | |
| } | |
| // Per-row annotation block (carried on every model_result row) | |
| export interface RowAnnotations { | |
| reproducibility_gap: ReproducibilityGap | null | |
| provenance: Provenance | null | |
| variant_divergence: VariantDivergence | null | |
| cross_party_divergence: CrossPartyDivergence | null | |
| } | |
| // Spec Β§4 | |
| export interface ReportingCompleteness { | |
| completeness_score: number // [0, 1] | |
| total_fields_evaluated: number | |
| missing_required_fields: string[] // dotted paths | |
| partial_fields: Array<{ | |
| field_path: string | |
| score: number // (0, 1) β strictly between | |
| populated_subitems: number | |
| total_subitems: number | |
| }> | |
| field_scores: Array<{ | |
| field_path: string | |
| coverage_type: "full" | "partial" | "reserved" | |
| score: number // [0, 1] | |
| }> | |
| signal_version: string | |
| } | |
| export interface BenchmarkComparability { | |
| variant_divergence_groups: Array<{ | |
| group_id: string | |
| model_route_id: string | |
| divergence_magnitude: number | |
| threshold_used: number | |
| threshold_basis: VariantDivergence["threshold_basis"] | |
| differing_setup_fields: VariantDivergence["differing_setup_fields"] | |
| }> | |
| cross_party_divergence_groups: Array<{ | |
| group_id: string | |
| model_route_id: string | |
| divergence_magnitude: number | |
| threshold_used: number | |
| threshold_basis: VariantDivergence["threshold_basis"] | |
| scores_by_organization: Record<string, number> | |
| differing_setup_fields: VariantDivergence["differing_setup_fields"] | |
| }> | |
| } | |
| // Eval-root or model-root annotation block | |
| export interface EvalcardsAnnotations { | |
| reporting_completeness?: ReportingCompleteness | |
| benchmark_comparability?: BenchmarkComparability | |
| } | |
| // Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes) | |
| export interface ReproducibilitySummary { | |
| results_total: number | |
| has_reproducibility_gap_count: number | |
| populated_ratio_avg: number | null // null when results_total == 0 | |
| } | |
| export interface ProvenanceSummary { | |
| total_results: number | |
| total_groups: number | |
| multi_source_groups: number | |
| first_party_only_groups: number | |
| source_type_distribution: Record<ProvenanceSourceType, number> | |
| } | |
| export interface ComparabilitySummary { | |
| total_groups: number | |
| groups_with_variant_check: number // eligible groups (>=2 rows, differing setups, >=2 scored) | |
| groups_with_cross_party_check: number // eligible groups (>=2 named orgs) | |
| variant_divergent_count: number | |
| cross_party_divergent_count: number | |
| } | |
| export interface SignalSummaries { | |
| reproducibility_summary?: ReproducibilitySummary | |
| provenance_summary?: ProvenanceSummary | |
| comparability_summary?: ComparabilitySummary | |
| } | |
| // corpus-aggregates.json | |
| export interface CorpusAggregates { | |
| generated_at: string | |
| signal_version: string | |
| stratification_dimensions: ["category"] | |
| reproducibility: Stratified<ReproducibilityCorpusBlock> | |
| completeness: Stratified<CompletenessCorpusBlock> | |
| provenance: Stratified<ProvenanceCorpusBlock> | |
| comparability: Stratified<ComparabilityCorpusBlock> | |
| } | |
| export interface Stratified<T> { | |
| overall: T | |
| by_category: Record<string, T> // categories: agentic | general | knowledge | reasoning | safety | other | |
| } | |
| export interface ReproducibilityCorpusBlock { | |
| total_triples: number | |
| triples_with_reproducibility_gap: number | |
| reproducibility_gap_rate: number | null | |
| agentic_triples: number | |
| per_field_missingness: Record<string, { | |
| missing_count: number | |
| missing_rate: number | null | |
| denominator: "all_triples" | "agentic_only" | |
| denominator_count: number | |
| }> | |
| } | |
| export interface CompletenessCorpusBlock { | |
| total_benchmarks: number | |
| completeness_score_mean: number | null | |
| completeness_score_median: number | null | |
| per_field_population: Record<string, { | |
| mean_score: number | |
| populated_rate: number | |
| fully_populated_rate: number | |
| benchmark_count: number | |
| }> | |
| } | |
| export interface ProvenanceCorpusBlock { | |
| total_triples: number | |
| total_groups: number | |
| multi_source_groups: number | |
| multi_source_rate: number | null | |
| first_party_only_groups: number | |
| first_party_only_rate: number | null | |
| source_type_distribution: Record<ProvenanceSourceType, number> | |
| } | |
| export interface ComparabilityCorpusBlock { | |
| total_groups: number | |
| variant_eligible_groups: number | |
| variant_divergent_groups: number | |
| variant_divergence_rate: number | null | |
| cross_party_eligible_groups: number | |
| cross_party_divergent_groups: number | |
| cross_party_divergence_rate: number | null // commonly null on current corpus | |
| } | |
| ``` | |
| Then in `lib/hf-data.ts`: | |
| - Extend `HFEvalModelResult` (line ~522) with `evalcards?: { annotations?: RowAnnotations }`. | |
| - Extend `HFEvalDetail` (line ~556) with `evalcards?: { annotations?: EvalcardsAnnotations }` plus the three summary fields from `SignalSummaries`. | |
| - Extend `HFEvalListEntry` (line ~475) with `SignalSummaries` fields. | |
| - Extend `HFModelCardEntry` (line ~439) with `SignalSummaries` fields. | |
| - Extend `HFModelDetail` (line ~571) with `SignalSummaries` fields. | |
| - Extend `HFModelHierarchyMetric` (line ~616) β `model_results` already typed as `HFEvalModelResult`, so the per-row annotations propagate automatically. | |
| In `EvalHierarchy` types (`lib/backend-artifacts.ts` line ~54), add `SignalSummaries` to both `HierarchyFamily` and `HierarchyBenchmark`. | |
| All fields are **optional** at the type level β older cached snapshots won't have them, and the frontend should render gracefully when they're absent. | |
| --- | |
| ## 3. Data plumbing | |
| ### 3.1 New fetcher + API route for corpus aggregates | |
| In `lib/hf-data.ts`, add after the existing fetchers (~line 866): | |
| ```ts | |
| export async function fetchCorpusAggregates(): Promise<CorpusAggregates | null> { | |
| return fetchHFJsonSafe<CorpusAggregates>("corpus-aggregates.json") | |
| } | |
| ``` | |
| Add to `scripts/cache-hf-data.mjs` `CACHE_ROOT_FILES` array: `"corpus-aggregates.json"`. (Mark it optional in `OPTIONAL_CACHE_ROOT_FILES` if shipping while the HF dataset upload is still rolling β once the backend pipeline next runs against the dataset, the file will appear.) | |
| Create `app/api/corpus-aggregates/route.ts`: | |
| ```ts | |
| import { NextResponse } from "next/server" | |
| import { fetchCorpusAggregates } from "@/lib/hf-data" | |
| export async function GET() { | |
| const aggregates = await fetchCorpusAggregates() | |
| if (!aggregates) { | |
| return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 }) | |
| } | |
| return NextResponse.json(aggregates) | |
| } | |
| ``` | |
| ### 3.2 Rest of plumbing is automatic | |
| Existing fetchers (`fetchEvalDetail`, `fetchModelDetail`, `fetchEvalList`, `fetchModelCardsList`, `fetchEvalHierarchy`) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened. | |
| --- | |
| ## 4. UX components to build | |
| Build a small set of reusable signal components in `components/signals/`. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across `eval-detail.tsx`, `benchmark-detail.tsx`, `model-compare-dialog.tsx`, and the new corpus dashboard. | |
| ``` | |
| components/signals/ | |
| βββ reproducibility-badge.tsx | |
| βββ provenance-badge.tsx // already partially exists in benchmark-detail.tsx β see Β§4.2 | |
| βββ variant-divergence-badge.tsx | |
| βββ cross-party-divergence-badge.tsx | |
| βββ reproducibility-panel.tsx // detail view β full missing-fields list | |
| βββ completeness-panel.tsx // detail view β score bar + missing-field list | |
| βββ comparability-panel.tsx // detail view β divergent groups list | |
| βββ signals-row-badges.tsx // composite: renders all four row-level badges with proper spacing | |
| βββ signal-tooltip.tsx // shared tooltip primitive | |
| ``` | |
| All badges should follow the existing tone conventions used by `getRelationshipBadgeTone` ([components/benchmark-detail.tsx:289](../components/benchmark-detail.tsx#L289)) and the `Badge` primitive in [components/ui/badge.tsx](../components/ui/badge.tsx). | |
| ### 4.1 Row-level badges β placement | |
| Insert `<SignalsRowBadges annotations={modelResult.evalcards?.annotations} />` next to the score cell in: | |
| - **Eval detail leaderboard table** β [components/eval-detail.tsx:869-871](../components/eval-detail.tsx#L869-L871) (the `<TableCell className="text-right">` containing the score). Render badges below the score on a new line for desktop, hidden on mobile. | |
| - **Benchmark detail rows** β `components/benchmark-detail.tsx` renders score rows in several places (search for `formatRawScoreValue`); insert the same component. | |
| - **Model compare dialog** β [components/model-compare-dialog.tsx](../components/model-compare-dialog.tsx) score columns. | |
| **Display rules β only badge for actionable states.** Silence is meaningful here. | |
| | Signal | Show badge when | Hide when | | |
| |---|---|---| | |
| | Reproducibility | `has_reproducibility_gap === true` | gap=false, or annotation absent | | |
| | Provenance | `source_type` β {`first_party`, `third_party`, `collaborative`} | `source_type === "unspecified"` | | |
| | Variant divergence | `variant_divergence !== null && has_variant_divergence === true` | null (not applicable) or false (checked, fine) | | |
| | Cross-party divergence | `cross_party_divergence !== null && has_cross_party_divergence === true` | null (almost always on current corpus) or false | | |
| `has_*: false` means "we checked and it's fine" β silent success. `null` means "not applicable / not enough data" β also silent. **Only divergent / gap-positive states warrant pixels.** | |
| **Dedup rule.** `variant_divergence` and `cross_party_divergence` are duplicated onto every row in the same group. If you render three rows from the same `group_id`, render the divergence badge on each row but the *expanded panel* (Β§4.4) only once at the group header. | |
| ### 4.2 Provenance badge β reuse what's there | |
| [components/benchmark-detail.tsx:262-302](../components/benchmark-detail.tsx#L262-L302) already has `getRelationshipShortLabel` and `getRelationshipBadgeTone`. Extract these into `components/signals/provenance-badge.tsx` and import back into `benchmark-detail.tsx`. The new badge should **also** consume the new `Provenance` annotation when present (it carries `is_multi_source` and `first_party_only`, which the current implementation derives row-by-row from `source_metadata` alone). | |
| When `provenance.first_party_only === true`, show a small β subtle indicator on the first-party badge ("first-party only β no independent replication"). This is the headline use of the signal for policy-mode readers. | |
| ### 4.3 Reproducibility badge β content rules | |
| Tooltip content depends on audience mode (`useAudienceMode()` from [components/audience-mode-provider.tsx:40](../components/audience-mode-provider.tsx#L40)): | |
| - Research mode: "Setup not fully documented. Missing: `max_tokens`, `eval_plan`." | |
| - Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is." | |
| Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" β the active runtime checks 2 base fields (`temperature`, `max_tokens`) plus 2 agentic fields (`eval_plan`, `eval_limits`) when the benchmark is agentic. Read counts off the annotation. | |
| ### 4.4 Detail panels β placement | |
| #### Reproducibility panel | |
| The existing "Evaluation Provenance" panel in [components/eval-detail.tsx:952-998](../components/eval-detail.tsx#L952-L998) (rendered when a row is expanded) is the right place for the **per-row** reproducibility breakdown. Add a new `DetailPanel` adjacent to it: | |
| ```tsx | |
| {rowAnnotations?.reproducibility_gap && ( | |
| <DetailPanel | |
| title={isResearchView ? "Reproducibility" : "Re-runnability"} | |
| subtitle={ | |
| isResearchView | |
| ? "Whether the setup is documented well enough for someone else to re-run." | |
| : "Whether someone could re-run this evaluation with the information available." | |
| } | |
| > | |
| <MetaRow | |
| label="Setup fields recorded" | |
| value={`${rowAnnotations.reproducibility_gap.populated_field_count} of ${rowAnnotations.reproducibility_gap.required_field_count}`} | |
| /> | |
| {rowAnnotations.reproducibility_gap.missing_fields.length > 0 && ( | |
| <MetaRow | |
| label="Missing" | |
| value={rowAnnotations.reproducibility_gap.missing_fields.join(", ")} | |
| /> | |
| )} | |
| </DetailPanel> | |
| )} | |
| ``` | |
| #### Completeness panel | |
| Render at the **eval-detail header level** (above the leaderboard, below the metric specification card). New `<CompletenessPanel completeness={detail.evalcards?.annotations?.reporting_completeness} />`. UI: progress bar showing `completeness_score`, label "{N} of {M} fields populated" where N = sum of `field_scores[].score` rounded, M = `total_fields_evaluated`. Below: collapsible accordions: | |
| - **Missing required fields** (count badge) β list of `missing_required_fields` with friendly labels (see Β§6.4 for label mapping). | |
| - **Partially populated** (count badge) β `partial_fields` rendered as "{field}: {populated_subitems}/{total_subitems}". | |
| In policy mode, don't show the dotted-path field names β show friendly labels only. In research mode, show both. | |
| #### Comparability panel | |
| Also at eval-detail header level. Sourced from `detail.evalcards?.annotations?.benchmark_comparability`. Render as two collapsibles β "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use `model_route_id` from each group entry as anchor β add `id={"row-" + model_route_id}` on the leaderboard row). | |
| When both arrays are empty, hide the panel entirely. When `comparability_summary.groups_with_cross_party_check === 0` (the common state), surface a small note: "No third-party reports available for cross-party comparison." | |
| ### 4.5 Per-eval header chips | |
| On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around [components/eval-detail.tsx:486-525](../components/eval-detail.tsx#L486-L525)), add a fourth chip when `evalcards.annotations.reporting_completeness` is present: | |
| > **Documentation** | |
| > {round(completeness_score * 100)}% | |
| Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark." | |
| ### 4.6 Per-model card chips | |
| On `components/eval-card.tsx` and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at [components/eval-card.tsx:250](../components/eval-card.tsx#L250) ("Some results lack generation settings; compare scores with care.") with a data-driven version: | |
| > {has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented. | |
| Show only when `has_reproducibility_gap_count > 0`. The hand-written hint was a placeholder for exactly this signal β wire it up. | |
| --- | |
| ## 5. New page: corpus dashboard | |
| Add `app/corpus/page.tsx` (linked from main navigation [components/navigation.tsx](../components/navigation.tsx)). Server component that calls `fetchCorpusAggregates()` and renders four sections: | |
| ### 5.1 Reproducibility section | |
| - Headline number: `reproducibility_gap_rate` rendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores." | |
| - Per-field horizontal bar chart from `per_field_missingness`. **Bar denominator depends on `denominator` field**: agentic-only fields use `agentic_triples`, others use `total_triples`. Label each bar with the denominator type so users understand. | |
| - Toggle: `overall` β `by_category` (rendered as a small-multiple grid, one panel per category). | |
| ### 5.2 Completeness section | |
| - Headline: `completeness_score_mean` (and median) across `total_benchmarks`. | |
| - Histogram of per-benchmark scores (pull individual benchmark scores from `eval-list.json` `reporting_completeness.completeness_score`, since corpus-aggregates only carries mean/median). | |
| - Per-field bar chart from `per_field_population` β three bars per field: `mean_score`, `populated_rate`, `fully_populated_rate`. (See Β§6.7 for which one to highlight per coverage type.) | |
| ### 5.3 Provenance section | |
| - Stacked bar of `source_type_distribution` (across all triples). | |
| - Two ratios: `multi_source_rate`, `first_party_only_rate`. Label both: "% of (model, benchmark, metric) groups." | |
| ### 5.4 Comparability section | |
| - Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null). | |
| - **When `cross_party_divergence_rate === null`:** show a "Not enough multi-org coverage to compute" empty state, not "0%". Same for `variant_divergence_rate === null`. This is critical β see Β§6.5. | |
| All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default). | |
| --- | |
| ## 6. Caveats and edge cases (read these before implementing) | |
| ### 6.1 `first_party_only` semantics | |
| A row can be `first_party_only: true` even when `is_multi_source: false`. The spec literal: a group with one *named* org reporting first-party gets the badge. **Don't read it as "exclusive coverage"** β read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source." | |
| If `distinct_reporting_organizations === 0` (all rows have null org), `first_party_only` is `false` even when `source_type === "first_party"`. Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge. | |
| ### 6.2 Active reproducibility field set is reduced | |
| The spec describes four base fields (`temperature`, `top_p`, `max_tokens`, `prompt_template`); the active backend currently checks **only `temperature` and `max_tokens`** plus `eval_plan` / `eval_limits` for agentic benchmarks. **Don't hardcode "4 fields" anywhere.** Always read `required_field_count` off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface. | |
| ### 6.3 Missing-field path strings | |
| `missing_fields` for reproducibility uses bare names (e.g. `"max_tokens"`). `missing_required_fields` for completeness uses dotted paths (e.g. `"autobenchmarkcard.methodology.baseline_results"`). Different conventions, intentional. Build a small label map for completeness paths β paths come from [registry/completeness_fields.json](https://github.com/evaleval/eval_cards_backend_pipeline/blob/main/registry/completeness_fields.json) on the backend repo. Suggested label rules: | |
| - Drop the `autobenchmarkcard.` / `eee_eval.` / `evalcards.` prefix. | |
| - Replace dots with " / ", underscore with space, title-case. | |
| - Example: `autobenchmarkcard.methodology.baseline_results` β "Methodology / Baseline results". | |
| ### 6.4 `differing_setup_fields[].values` may contain null and mixed types | |
| Per spec Β§6.1.4, `null` is a *distinct* value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render `null` as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with `JSON.stringify` for objects, plain text otherwise. | |
| ### 6.5 `null` rates in comparability are *not* zero | |
| Eligibility-aware denominators mean `variant_divergence_rate` and `cross_party_divergence_rate` are `null` when no groups were eligible. **Render as "N/A β not enough data" or an empty-state card, never as "0%".** On the current corpus, `cross_party_divergence_rate` will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error. | |
| ### 6.6 Score-scale anomaly flag | |
| `variant_divergence.score_scale_anomaly === true` indicates the metric was declared `proportion` but scores fell outside [0, 1] β usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples. | |
| ### 6.7 `mean_score` vs `populated_rate` for completeness | |
| Per-field aggregates expose three numbers. Pick which to display based on `coverage_type`: | |
| - **`full` and `reserved` fields** β `mean_score` and `populated_rate` are equal. Show one number labeled "% of benchmarks populating this field." | |
| - **`partial` fields** β they diverge. `populated_rate` = % of benchmarks with *any* sub-item; `mean_score` = average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items." | |
| ### 6.8 No `computed_at` on per-record annotations | |
| Only `signal_version` is on each annotation. For "last computed" UI text, use `manifest.json β generated_at` from the existing `BackendManifest`. | |
| ### 6.9 Stratification categories | |
| `by_category` keys are: `agentic`, `general`, `knowledge`, `reasoning`, `safety`, `other`. Same set as the existing `category` field on evals β reuse whatever color scheme is currently keyed off `inferCategoryFromBenchmark` ([lib/benchmark-schema.ts](../lib/benchmark-schema.ts)). | |
| ### 6.10 Annotation block can be `null` or absent | |
| `evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence}` can each be `null` independently, and the entire `evalcards` block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The `RowAnnotations` type intentionally types each subfield as `T | null` (not `T | undefined`) because the backend writes explicit `null`. | |
| --- | |
| ## 7. Suggested implementation order | |
| 1. **Types + plumbing** (1β2 hours): types in `backend-artifacts.ts` + `hf-data.ts`, the `fetchCorpusAggregates` fetcher, the API route, and adding `corpus-aggregates.json` to the cache script. No UI yet. | |
| 2. **Row-level badges** (Β½ day): build `signals/` directory with the four badge components, the dedup-aware `signals-row-badges.tsx`, and wire into eval-detail and benchmark-detail. This is the most visible win. | |
| 3. **Per-eval completeness panel + comparability panel** (Β½ day): single benchmark, easy to design around. New `CompletenessPanel` is the headline new UX in this set. | |
| 4. **Per-row reproducibility detail panel** (1β2 hours): drops into the existing expanded row layout. | |
| 5. **Per-eval / per-model header chips + replace the hand-written gap hint** (1β2 hours): wires the summary fields into existing card surfaces. | |
| 6. **Corpus dashboard page** (1β2 days): new route, new components, biggest scope. Defer until 1β5 are live and reviewed. | |
| Each step is independently shippable. Steps 1β5 can land before the corpus dashboard is designed. | |
| --- | |
| ## 8. Out of scope (don't do these yet) | |
| - **Filter / sort the eval list by signal state** ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want. | |
| - **Side-by-side score comparison with divergence overlay.** The data supports it (`scores_in_group`, `scores_by_organization`) but the design space is large. Hold off until we see the row-level badges in use. | |
| - **Recompute / verification UI for missing reproducibility fields.** Backend-side; out of scope here. | |
| - **Per-instance sample-level badges.** Signals operate at row / benchmark level; sample-level instance data is unaffected. | |
| --- | |
| ## 9. Reference: minimal real-shape examples | |
| Per-row `evalcards.annotations` with all four signals populated: | |
| ```jsonc | |
| { | |
| "reproducibility_gap": { | |
| "has_reproducibility_gap": true, | |
| "missing_fields": ["max_tokens"], | |
| "required_field_count": 2, | |
| "populated_field_count": 1, | |
| "signal_version": "1.0" | |
| }, | |
| "provenance": { | |
| "source_type": "first_party", | |
| "is_multi_source": false, | |
| "first_party_only": true, | |
| "distinct_reporting_organizations": 1, | |
| "signal_version": "1.0" | |
| }, | |
| "variant_divergence": null, | |
| "cross_party_divergence": null | |
| } | |
| ``` | |
| Per-eval `evalcards.annotations` with completeness + comparability: | |
| ```jsonc | |
| { | |
| "reporting_completeness": { | |
| "completeness_score": 0.62, | |
| "total_fields_evaluated": 28, | |
| "missing_required_fields": [ | |
| "autobenchmarkcard.methodology.baseline_results", | |
| "autobenchmarkcard.methodology.validation", | |
| "evalcards.preregistration_url" | |
| ], | |
| "partial_fields": [ | |
| { "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 } | |
| ], | |
| "field_scores": [/* 28 entries */], | |
| "signal_version": "1.0" | |
| }, | |
| "benchmark_comparability": { | |
| "variant_divergence_groups": [ | |
| { | |
| "group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy", | |
| "model_route_id": "openai__gpt-5", | |
| "divergence_magnitude": 0.12, | |
| "threshold_used": 0.05, | |
| "threshold_basis": "proportion_or_continuous_normalized", | |
| "differing_setup_fields": [ | |
| { "field": "max_tokens", "values": [2048, 4096, 8192] } | |
| ] | |
| } | |
| ], | |
| "cross_party_divergence_groups": [] | |
| } | |
| } | |
| ``` | |
| Top-level `provenance_summary` example: | |
| ```jsonc | |
| { | |
| "total_results": 142, | |
| "total_groups": 47, | |
| "multi_source_groups": 3, | |
| "first_party_only_groups": 30, | |
| "source_type_distribution": { | |
| "first_party": 120, | |
| "third_party": 18, | |
| "collaborative": 0, | |
| "unspecified": 4 | |
| } | |
| } | |
| ``` | |
| `corpus-aggregates.json` structure (top of file): | |
| ```jsonc | |
| { | |
| "generated_at": "2026-04-27T...", | |
| "signal_version": "1.0", | |
| "stratification_dimensions": ["category"], | |
| "reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } }, | |
| "completeness": { "overall": {/* CompletenessCorpusBlock */}, "by_category": {...} }, | |
| "provenance": { "overall": {/* ProvenanceCorpusBlock */}, "by_category": {...} }, | |
| "comparability": { "overall": {/* ComparabilityCorpusBlock */}, "by_category": {...} } | |
| } | |
| ``` | |
| --- | |
| ## 10. Audience-mode wording cheatsheet | |
| | Element | Research mode | Policy mode | | |
| |---|---|---| | |
| | Reproducibility gap badge | "Reproducibility gap" | "Setup not documented" | | |
| | Reproducibility tooltip | "Setup not fully documented. Missing: {fields}." | "This score's setup isn't documented, so it can't be re-run as-is." | | |
| | Reproducibility panel title | "Reproducibility" | "Re-runnability" | | |
| | Completeness chip label | "Documentation" | "Documentation" | | |
| | Completeness panel title | "Reporting completeness" | "How well is this benchmark documented?" | | |
| | Provenance: first-party | "1st party" | "Reported by model developer" | | |
| | Provenance: first-party only | "1st party only β no replication" | "Only the model developer reported this score" | | |
| | Provenance: third-party | "3rd party" | "Independently reported" | | |
| | Provenance: collaborative | "Collaborative" | "Joint report" | | |
| | Variant divergence badge | "Variant divergence" | "Score depends on setup" | | |
| | Variant divergence tooltip | "Scores diverge by {magnitude} across different setups: {fields}." | "Different runs of this evaluation produced different scores β the setup matters." | | |
| | Cross-party divergence badge | "Cross-party divergence" | "Sources disagree" | | |
| | Cross-party divergence tooltip | "Reports diverge by {magnitude} across organizations." | "Different organizations reported different scores for this same model on this same benchmark." | | |
| Adjust tone but keep the underlying numbers identical across modes β the data is the same, only the framing changes. | |
| --- | |
| *Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics β [eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). Questions on UX β discuss with @anka-evals + frontend team.* | |