# EvalCards interpretive signals — frontend implementation spec **Status:** ready to implement. Backend ships in `evaleval/eval_cards_backend_pipeline` PR #1 (merged `b05323c`). All field shapes below are stable and covered by the backend's test suite. **Companion docs:** - Spec source of truth: *EvalCards Interpretive Signals v1.0* (Anka Reuel, Stanford). Section refs (§3, §4, …) below point at that doc. - Open backend questions: [evaleval/eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). None block frontend work — they may shift wording, not shape. --- ## 0. What this PR does at a glance The backend now annotates evaluation records with four interpretive signals: 1. **Reproducibility gap** — *per row.* Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing `max_tokens`"). 2. **Reporting completeness** — *per benchmark.* What fraction of EvalCards-required documentation fields are populated? Surfaced as a `[0, 1]` score with a missing-field breakdown. 3. **Provenance** — *per row.* Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this `(model, benchmark, metric)` group? 4. **Comparability** — *per `(model, benchmark, metric)` group.* Two flavors: **variant divergence** (same model, same benchmark, different setups → diverging scores) and **cross-party divergence** (different orgs reporting → diverging scores). Plus a corpus-level rollup file (`corpus-aggregates.json`) for a stratified analytics page. The frontend's job: surface these signals **in three places** — row-level badges, per-eval / per-model summary panels, and a corpus dashboard view. --- ## 1. Where the new data lives All fields are new additions to existing artifacts. No artifact is removed or reshaped. | Artifact | New fields | |---|---| | `evals/{id}.json` (`HFEvalDetail`) | Per-row `evalcards.annotations` block on every `metrics[].model_results[]` and `subtasks[…].metrics[].model_results[]`. Plus eval-root `evalcards.annotations.reporting_completeness`, `evalcards.annotations.benchmark_comparability`, and three top-level summaries: `reproducibility_summary`, `provenance_summary`, `comparability_summary`. | | `models/{id}.json` (`HFModelDetail`) | Per-row `evalcards.annotations` block on every `hierarchy_by_category[*][*].metrics[].model_results[]`. Plus three top-level summaries scoped to that model. | | `eval-list.json` / `eval-list-lite.json` (`HFEvalListEntry`) | Three summaries per entry. | | `model-cards.json` / `model-cards-lite.json` (`HFModelCardEntry`) | Three summaries per entry. | | `eval-hierarchy.json` (`EvalHierarchy`) | Each family node and leaf node carries the three summaries (aggregated over evals under it). | | **`corpus-aggregates.json` (NEW FILE)** | Stratified rollups for paper / dashboard use. | | `manifest.json` | New entry in `summary_artifacts`: `corpus_aggregates: "corpus-aggregates.json"`. | `signal_version` (currently `"1.0"`) is present on every annotation. Treat it as opaque; surface only in admin/debug. --- ## 2. TypeScript types to add Add to `lib/backend-artifacts.ts` (preferred — these are pipeline contract types): ```ts // Spec §3 export interface ReproducibilityGap { has_reproducibility_gap: boolean missing_fields: string[] // e.g. ["max_tokens"] required_field_count: number // 2 base + 2 if agentic on current runtime populated_field_count: number signal_version: string } // Spec §5 export type ProvenanceSourceType = | "first_party" | "third_party" | "collaborative" | "unspecified" export interface Provenance { source_type: ProvenanceSourceType is_multi_source: boolean first_party_only: boolean // see §6.1 below for caveat distinct_reporting_organizations: number signal_version: string } // Spec §6.1 export interface VariantDivergence { has_variant_divergence: boolean group_id: string // "{model_route_id}__{metric_summary_id}" divergence_magnitude: number threshold_used: number threshold_basis: | "proportion_or_continuous_normalized" | "percent" | "range_5pct" | "fallback_default" differing_setup_fields: Array<{ field: string; values: unknown[] }> scores_in_group: number[] this_triple_score: number | null // this row's score within the group triple_count_in_group: number score_scale_anomaly: boolean group_variant_breakdown: Array<{ variant_key: string; row_count: number }> signal_version: string } // Spec §6.2 export interface CrossPartyDivergence { has_cross_party_divergence: boolean group_id: string divergence_magnitude: number threshold_used: number threshold_basis: VariantDivergence["threshold_basis"] scores_by_organization: Record // display org name → score differing_setup_fields: Array<{ field: string; values: unknown[] }> organization_count: number group_variant_breakdown: Array<{ variant_key: string; row_count: number }> signal_version: string } // Per-row annotation block (carried on every model_result row) export interface RowAnnotations { reproducibility_gap: ReproducibilityGap | null provenance: Provenance | null variant_divergence: VariantDivergence | null cross_party_divergence: CrossPartyDivergence | null } // Spec §4 export interface ReportingCompleteness { completeness_score: number // [0, 1] total_fields_evaluated: number missing_required_fields: string[] // dotted paths partial_fields: Array<{ field_path: string score: number // (0, 1) — strictly between populated_subitems: number total_subitems: number }> field_scores: Array<{ field_path: string coverage_type: "full" | "partial" | "reserved" score: number // [0, 1] }> signal_version: string } export interface BenchmarkComparability { variant_divergence_groups: Array<{ group_id: string model_route_id: string divergence_magnitude: number threshold_used: number threshold_basis: VariantDivergence["threshold_basis"] differing_setup_fields: VariantDivergence["differing_setup_fields"] }> cross_party_divergence_groups: Array<{ group_id: string model_route_id: string divergence_magnitude: number threshold_used: number threshold_basis: VariantDivergence["threshold_basis"] scores_by_organization: Record differing_setup_fields: VariantDivergence["differing_setup_fields"] }> } // Eval-root or model-root annotation block export interface EvalcardsAnnotations { reporting_completeness?: ReportingCompleteness benchmark_comparability?: BenchmarkComparability } // Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes) export interface ReproducibilitySummary { results_total: number has_reproducibility_gap_count: number populated_ratio_avg: number | null // null when results_total == 0 } export interface ProvenanceSummary { total_results: number total_groups: number multi_source_groups: number first_party_only_groups: number source_type_distribution: Record } export interface ComparabilitySummary { total_groups: number groups_with_variant_check: number // eligible groups (>=2 rows, differing setups, >=2 scored) groups_with_cross_party_check: number // eligible groups (>=2 named orgs) variant_divergent_count: number cross_party_divergent_count: number } export interface SignalSummaries { reproducibility_summary?: ReproducibilitySummary provenance_summary?: ProvenanceSummary comparability_summary?: ComparabilitySummary } // corpus-aggregates.json export interface CorpusAggregates { generated_at: string signal_version: string stratification_dimensions: ["category"] reproducibility: Stratified completeness: Stratified provenance: Stratified comparability: Stratified } export interface Stratified { overall: T by_category: Record // categories: agentic | general | knowledge | reasoning | safety | other } export interface ReproducibilityCorpusBlock { total_triples: number triples_with_reproducibility_gap: number reproducibility_gap_rate: number | null agentic_triples: number per_field_missingness: Record } export interface CompletenessCorpusBlock { total_benchmarks: number completeness_score_mean: number | null completeness_score_median: number | null per_field_population: Record } export interface ProvenanceCorpusBlock { total_triples: number total_groups: number multi_source_groups: number multi_source_rate: number | null first_party_only_groups: number first_party_only_rate: number | null source_type_distribution: Record } export interface ComparabilityCorpusBlock { total_groups: number variant_eligible_groups: number variant_divergent_groups: number variant_divergence_rate: number | null cross_party_eligible_groups: number cross_party_divergent_groups: number cross_party_divergence_rate: number | null // commonly null on current corpus } ``` Then in `lib/hf-data.ts`: - Extend `HFEvalModelResult` (line ~522) with `evalcards?: { annotations?: RowAnnotations }`. - Extend `HFEvalDetail` (line ~556) with `evalcards?: { annotations?: EvalcardsAnnotations }` plus the three summary fields from `SignalSummaries`. - Extend `HFEvalListEntry` (line ~475) with `SignalSummaries` fields. - Extend `HFModelCardEntry` (line ~439) with `SignalSummaries` fields. - Extend `HFModelDetail` (line ~571) with `SignalSummaries` fields. - Extend `HFModelHierarchyMetric` (line ~616) — `model_results` already typed as `HFEvalModelResult`, so the per-row annotations propagate automatically. In `EvalHierarchy` types (`lib/backend-artifacts.ts` line ~54), add `SignalSummaries` to both `HierarchyFamily` and `HierarchyBenchmark`. All fields are **optional** at the type level — older cached snapshots won't have them, and the frontend should render gracefully when they're absent. --- ## 3. Data plumbing ### 3.1 New fetcher + API route for corpus aggregates In `lib/hf-data.ts`, add after the existing fetchers (~line 866): ```ts export async function fetchCorpusAggregates(): Promise { return fetchHFJsonSafe("corpus-aggregates.json") } ``` Add to `scripts/cache-hf-data.mjs` `CACHE_ROOT_FILES` array: `"corpus-aggregates.json"`. (Mark it optional in `OPTIONAL_CACHE_ROOT_FILES` if shipping while the HF dataset upload is still rolling — once the backend pipeline next runs against the dataset, the file will appear.) Create `app/api/corpus-aggregates/route.ts`: ```ts import { NextResponse } from "next/server" import { fetchCorpusAggregates } from "@/lib/hf-data" export async function GET() { const aggregates = await fetchCorpusAggregates() if (!aggregates) { return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 }) } return NextResponse.json(aggregates) } ``` ### 3.2 Rest of plumbing is automatic Existing fetchers (`fetchEvalDetail`, `fetchModelDetail`, `fetchEvalList`, `fetchModelCardsList`, `fetchEvalHierarchy`) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened. --- ## 4. UX components to build Build a small set of reusable signal components in `components/signals/`. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across `eval-detail.tsx`, `benchmark-detail.tsx`, `model-compare-dialog.tsx`, and the new corpus dashboard. ``` components/signals/ ├── reproducibility-badge.tsx ├── provenance-badge.tsx // already partially exists in benchmark-detail.tsx — see §4.2 ├── variant-divergence-badge.tsx ├── cross-party-divergence-badge.tsx ├── reproducibility-panel.tsx // detail view — full missing-fields list ├── completeness-panel.tsx // detail view — score bar + missing-field list ├── comparability-panel.tsx // detail view — divergent groups list ├── signals-row-badges.tsx // composite: renders all four row-level badges with proper spacing └── signal-tooltip.tsx // shared tooltip primitive ``` All badges should follow the existing tone conventions used by `getRelationshipBadgeTone` ([components/benchmark-detail.tsx:289](../components/benchmark-detail.tsx#L289)) and the `Badge` primitive in [components/ui/badge.tsx](../components/ui/badge.tsx). ### 4.1 Row-level badges — placement Insert `` next to the score cell in: - **Eval detail leaderboard table** — [components/eval-detail.tsx:869-871](../components/eval-detail.tsx#L869-L871) (the `` containing the score). Render badges below the score on a new line for desktop, hidden on mobile. - **Benchmark detail rows** — `components/benchmark-detail.tsx` renders score rows in several places (search for `formatRawScoreValue`); insert the same component. - **Model compare dialog** — [components/model-compare-dialog.tsx](../components/model-compare-dialog.tsx) score columns. **Display rules — only badge for actionable states.** Silence is meaningful here. | Signal | Show badge when | Hide when | |---|---|---| | Reproducibility | `has_reproducibility_gap === true` | gap=false, or annotation absent | | Provenance | `source_type` ∈ {`first_party`, `third_party`, `collaborative`} | `source_type === "unspecified"` | | Variant divergence | `variant_divergence !== null && has_variant_divergence === true` | null (not applicable) or false (checked, fine) | | Cross-party divergence | `cross_party_divergence !== null && has_cross_party_divergence === true` | null (almost always on current corpus) or false | `has_*: false` means "we checked and it's fine" — silent success. `null` means "not applicable / not enough data" — also silent. **Only divergent / gap-positive states warrant pixels.** **Dedup rule.** `variant_divergence` and `cross_party_divergence` are duplicated onto every row in the same group. If you render three rows from the same `group_id`, render the divergence badge on each row but the *expanded panel* (§4.4) only once at the group header. ### 4.2 Provenance badge — reuse what's there [components/benchmark-detail.tsx:262-302](../components/benchmark-detail.tsx#L262-L302) already has `getRelationshipShortLabel` and `getRelationshipBadgeTone`. Extract these into `components/signals/provenance-badge.tsx` and import back into `benchmark-detail.tsx`. The new badge should **also** consume the new `Provenance` annotation when present (it carries `is_multi_source` and `first_party_only`, which the current implementation derives row-by-row from `source_metadata` alone). When `provenance.first_party_only === true`, show a small ⚠ subtle indicator on the first-party badge ("first-party only — no independent replication"). This is the headline use of the signal for policy-mode readers. ### 4.3 Reproducibility badge — content rules Tooltip content depends on audience mode (`useAudienceMode()` from [components/audience-mode-provider.tsx:40](../components/audience-mode-provider.tsx#L40)): - Research mode: "Setup not fully documented. Missing: `max_tokens`, `eval_plan`." - Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is." Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" — the active runtime checks 2 base fields (`temperature`, `max_tokens`) plus 2 agentic fields (`eval_plan`, `eval_limits`) when the benchmark is agentic. Read counts off the annotation. ### 4.4 Detail panels — placement #### Reproducibility panel The existing "Evaluation Provenance" panel in [components/eval-detail.tsx:952-998](../components/eval-detail.tsx#L952-L998) (rendered when a row is expanded) is the right place for the **per-row** reproducibility breakdown. Add a new `DetailPanel` adjacent to it: ```tsx {rowAnnotations?.reproducibility_gap && ( {rowAnnotations.reproducibility_gap.missing_fields.length > 0 && ( )} )} ``` #### Completeness panel Render at the **eval-detail header level** (above the leaderboard, below the metric specification card). New ``. UI: progress bar showing `completeness_score`, label "{N} of {M} fields populated" where N = sum of `field_scores[].score` rounded, M = `total_fields_evaluated`. Below: collapsible accordions: - **Missing required fields** (count badge) — list of `missing_required_fields` with friendly labels (see §6.4 for label mapping). - **Partially populated** (count badge) — `partial_fields` rendered as "{field}: {populated_subitems}/{total_subitems}". In policy mode, don't show the dotted-path field names — show friendly labels only. In research mode, show both. #### Comparability panel Also at eval-detail header level. Sourced from `detail.evalcards?.annotations?.benchmark_comparability`. Render as two collapsibles — "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use `model_route_id` from each group entry as anchor — add `id={"row-" + model_route_id}` on the leaderboard row). When both arrays are empty, hide the panel entirely. When `comparability_summary.groups_with_cross_party_check === 0` (the common state), surface a small note: "No third-party reports available for cross-party comparison." ### 4.5 Per-eval header chips On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around [components/eval-detail.tsx:486-525](../components/eval-detail.tsx#L486-L525)), add a fourth chip when `evalcards.annotations.reporting_completeness` is present: > **Documentation** > {round(completeness_score * 100)}% Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark." ### 4.6 Per-model card chips On `components/eval-card.tsx` and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at [components/eval-card.tsx:250](../components/eval-card.tsx#L250) ("Some results lack generation settings; compare scores with care.") with a data-driven version: > {has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented. Show only when `has_reproducibility_gap_count > 0`. The hand-written hint was a placeholder for exactly this signal — wire it up. --- ## 5. New page: corpus dashboard Add `app/corpus/page.tsx` (linked from main navigation [components/navigation.tsx](../components/navigation.tsx)). Server component that calls `fetchCorpusAggregates()` and renders four sections: ### 5.1 Reproducibility section - Headline number: `reproducibility_gap_rate` rendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores." - Per-field horizontal bar chart from `per_field_missingness`. **Bar denominator depends on `denominator` field**: agentic-only fields use `agentic_triples`, others use `total_triples`. Label each bar with the denominator type so users understand. - Toggle: `overall` ↔ `by_category` (rendered as a small-multiple grid, one panel per category). ### 5.2 Completeness section - Headline: `completeness_score_mean` (and median) across `total_benchmarks`. - Histogram of per-benchmark scores (pull individual benchmark scores from `eval-list.json` `reporting_completeness.completeness_score`, since corpus-aggregates only carries mean/median). - Per-field bar chart from `per_field_population` — three bars per field: `mean_score`, `populated_rate`, `fully_populated_rate`. (See §6.7 for which one to highlight per coverage type.) ### 5.3 Provenance section - Stacked bar of `source_type_distribution` (across all triples). - Two ratios: `multi_source_rate`, `first_party_only_rate`. Label both: "% of (model, benchmark, metric) groups." ### 5.4 Comparability section - Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null). - **When `cross_party_divergence_rate === null`:** show a "Not enough multi-org coverage to compute" empty state, not "0%". Same for `variant_divergence_rate === null`. This is critical — see §6.5. All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default). --- ## 6. Caveats and edge cases (read these before implementing) ### 6.1 `first_party_only` semantics A row can be `first_party_only: true` even when `is_multi_source: false`. The spec literal: a group with one *named* org reporting first-party gets the badge. **Don't read it as "exclusive coverage"** — read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source." If `distinct_reporting_organizations === 0` (all rows have null org), `first_party_only` is `false` even when `source_type === "first_party"`. Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge. ### 6.2 Active reproducibility field set is reduced The spec describes four base fields (`temperature`, `top_p`, `max_tokens`, `prompt_template`); the active backend currently checks **only `temperature` and `max_tokens`** plus `eval_plan` / `eval_limits` for agentic benchmarks. **Don't hardcode "4 fields" anywhere.** Always read `required_field_count` off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface. ### 6.3 Missing-field path strings `missing_fields` for reproducibility uses bare names (e.g. `"max_tokens"`). `missing_required_fields` for completeness uses dotted paths (e.g. `"autobenchmarkcard.methodology.baseline_results"`). Different conventions, intentional. Build a small label map for completeness paths — paths come from [registry/completeness_fields.json](https://github.com/evaleval/eval_cards_backend_pipeline/blob/main/registry/completeness_fields.json) on the backend repo. Suggested label rules: - Drop the `autobenchmarkcard.` / `eee_eval.` / `evalcards.` prefix. - Replace dots with " / ", underscore with space, title-case. - Example: `autobenchmarkcard.methodology.baseline_results` → "Methodology / Baseline results". ### 6.4 `differing_setup_fields[].values` may contain null and mixed types Per spec §6.1.4, `null` is a *distinct* value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render `null` as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with `JSON.stringify` for objects, plain text otherwise. ### 6.5 `null` rates in comparability are *not* zero Eligibility-aware denominators mean `variant_divergence_rate` and `cross_party_divergence_rate` are `null` when no groups were eligible. **Render as "N/A — not enough data" or an empty-state card, never as "0%".** On the current corpus, `cross_party_divergence_rate` will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error. ### 6.6 Score-scale anomaly flag `variant_divergence.score_scale_anomaly === true` indicates the metric was declared `proportion` but scores fell outside [0, 1] — usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples. ### 6.7 `mean_score` vs `populated_rate` for completeness Per-field aggregates expose three numbers. Pick which to display based on `coverage_type`: - **`full` and `reserved` fields** — `mean_score` and `populated_rate` are equal. Show one number labeled "% of benchmarks populating this field." - **`partial` fields** — they diverge. `populated_rate` = % of benchmarks with *any* sub-item; `mean_score` = average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items." ### 6.8 No `computed_at` on per-record annotations Only `signal_version` is on each annotation. For "last computed" UI text, use `manifest.json → generated_at` from the existing `BackendManifest`. ### 6.9 Stratification categories `by_category` keys are: `agentic`, `general`, `knowledge`, `reasoning`, `safety`, `other`. Same set as the existing `category` field on evals — reuse whatever color scheme is currently keyed off `inferCategoryFromBenchmark` ([lib/benchmark-schema.ts](../lib/benchmark-schema.ts)). ### 6.10 Annotation block can be `null` or absent `evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence}` can each be `null` independently, and the entire `evalcards` block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The `RowAnnotations` type intentionally types each subfield as `T | null` (not `T | undefined`) because the backend writes explicit `null`. --- ## 7. Suggested implementation order 1. **Types + plumbing** (1–2 hours): types in `backend-artifacts.ts` + `hf-data.ts`, the `fetchCorpusAggregates` fetcher, the API route, and adding `corpus-aggregates.json` to the cache script. No UI yet. 2. **Row-level badges** (½ day): build `signals/` directory with the four badge components, the dedup-aware `signals-row-badges.tsx`, and wire into eval-detail and benchmark-detail. This is the most visible win. 3. **Per-eval completeness panel + comparability panel** (½ day): single benchmark, easy to design around. New `CompletenessPanel` is the headline new UX in this set. 4. **Per-row reproducibility detail panel** (1–2 hours): drops into the existing expanded row layout. 5. **Per-eval / per-model header chips + replace the hand-written gap hint** (1–2 hours): wires the summary fields into existing card surfaces. 6. **Corpus dashboard page** (1–2 days): new route, new components, biggest scope. Defer until 1–5 are live and reviewed. Each step is independently shippable. Steps 1–5 can land before the corpus dashboard is designed. --- ## 8. Out of scope (don't do these yet) - **Filter / sort the eval list by signal state** ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want. - **Side-by-side score comparison with divergence overlay.** The data supports it (`scores_in_group`, `scores_by_organization`) but the design space is large. Hold off until we see the row-level badges in use. - **Recompute / verification UI for missing reproducibility fields.** Backend-side; out of scope here. - **Per-instance sample-level badges.** Signals operate at row / benchmark level; sample-level instance data is unaffected. --- ## 9. Reference: minimal real-shape examples Per-row `evalcards.annotations` with all four signals populated: ```jsonc { "reproducibility_gap": { "has_reproducibility_gap": true, "missing_fields": ["max_tokens"], "required_field_count": 2, "populated_field_count": 1, "signal_version": "1.0" }, "provenance": { "source_type": "first_party", "is_multi_source": false, "first_party_only": true, "distinct_reporting_organizations": 1, "signal_version": "1.0" }, "variant_divergence": null, "cross_party_divergence": null } ``` Per-eval `evalcards.annotations` with completeness + comparability: ```jsonc { "reporting_completeness": { "completeness_score": 0.62, "total_fields_evaluated": 28, "missing_required_fields": [ "autobenchmarkcard.methodology.baseline_results", "autobenchmarkcard.methodology.validation", "evalcards.preregistration_url" ], "partial_fields": [ { "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 } ], "field_scores": [/* 28 entries */], "signal_version": "1.0" }, "benchmark_comparability": { "variant_divergence_groups": [ { "group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy", "model_route_id": "openai__gpt-5", "divergence_magnitude": 0.12, "threshold_used": 0.05, "threshold_basis": "proportion_or_continuous_normalized", "differing_setup_fields": [ { "field": "max_tokens", "values": [2048, 4096, 8192] } ] } ], "cross_party_divergence_groups": [] } } ``` Top-level `provenance_summary` example: ```jsonc { "total_results": 142, "total_groups": 47, "multi_source_groups": 3, "first_party_only_groups": 30, "source_type_distribution": { "first_party": 120, "third_party": 18, "collaborative": 0, "unspecified": 4 } } ``` `corpus-aggregates.json` structure (top of file): ```jsonc { "generated_at": "2026-04-27T...", "signal_version": "1.0", "stratification_dimensions": ["category"], "reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } }, "completeness": { "overall": {/* CompletenessCorpusBlock */}, "by_category": {...} }, "provenance": { "overall": {/* ProvenanceCorpusBlock */}, "by_category": {...} }, "comparability": { "overall": {/* ComparabilityCorpusBlock */}, "by_category": {...} } } ``` --- ## 10. Audience-mode wording cheatsheet | Element | Research mode | Policy mode | |---|---|---| | Reproducibility gap badge | "Reproducibility gap" | "Setup not documented" | | Reproducibility tooltip | "Setup not fully documented. Missing: {fields}." | "This score's setup isn't documented, so it can't be re-run as-is." | | Reproducibility panel title | "Reproducibility" | "Re-runnability" | | Completeness chip label | "Documentation" | "Documentation" | | Completeness panel title | "Reporting completeness" | "How well is this benchmark documented?" | | Provenance: first-party | "1st party" | "Reported by model developer" | | Provenance: first-party only | "1st party only — no replication" | "Only the model developer reported this score" | | Provenance: third-party | "3rd party" | "Independently reported" | | Provenance: collaborative | "Collaborative" | "Joint report" | | Variant divergence badge | "Variant divergence" | "Score depends on setup" | | Variant divergence tooltip | "Scores diverge by {magnitude} across different setups: {fields}." | "Different runs of this evaluation produced different scores — the setup matters." | | Cross-party divergence badge | "Cross-party divergence" | "Sources disagree" | | Cross-party divergence tooltip | "Reports diverge by {magnitude} across organizations." | "Different organizations reported different scores for this same model on this same benchmark." | Adjust tone but keep the underlying numbers identical across modes — the data is the same, only the framing changes. --- *Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics → [eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). Questions on UX → discuss with @anka-evals + frontend team.*