Spaces:
Running
EvalCards interpretive signals β frontend implementation spec
Status: ready to implement. Backend ships in evaleval/eval_cards_backend_pipeline PR #1 (merged b05323c). All field shapes below are stable and covered by the backend's test suite.
Companion docs:
- Spec source of truth: EvalCards Interpretive Signals v1.0 (Anka Reuel, Stanford). Section refs (Β§3, Β§4, β¦) below point at that doc.
- Open backend questions: evaleval/eval_cards_backend_pipeline#2. None block frontend work β they may shift wording, not shape.
0. What this PR does at a glance
The backend now annotates evaluation records with four interpretive signals:
- Reproducibility gap β per row. Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing
max_tokens"). - Reporting completeness β per benchmark. What fraction of EvalCards-required documentation fields are populated? Surfaced as a
[0, 1]score with a missing-field breakdown. - Provenance β per row. Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this
(model, benchmark, metric)group? - Comparability β per
(model, benchmark, metric)group. Two flavors: variant divergence (same model, same benchmark, different setups β diverging scores) and cross-party divergence (different orgs reporting β diverging scores).
Plus a corpus-level rollup file (corpus-aggregates.json) for a stratified analytics page.
The frontend's job: surface these signals in three places β row-level badges, per-eval / per-model summary panels, and a corpus dashboard view.
1. Where the new data lives
All fields are new additions to existing artifacts. No artifact is removed or reshaped.
| Artifact | New fields |
|---|---|
evals/{id}.json (HFEvalDetail) |
Per-row evalcards.annotations block on every metrics[].model_results[] and subtasks[β¦].metrics[].model_results[]. Plus eval-root evalcards.annotations.reporting_completeness, evalcards.annotations.benchmark_comparability, and three top-level summaries: reproducibility_summary, provenance_summary, comparability_summary. |
models/{id}.json (HFModelDetail) |
Per-row evalcards.annotations block on every hierarchy_by_category[*][*].metrics[].model_results[]. Plus three top-level summaries scoped to that model. |
eval-list.json / eval-list-lite.json (HFEvalListEntry) |
Three summaries per entry. |
model-cards.json / model-cards-lite.json (HFModelCardEntry) |
Three summaries per entry. |
eval-hierarchy.json (EvalHierarchy) |
Each family node and leaf node carries the three summaries (aggregated over evals under it). |
corpus-aggregates.json (NEW FILE) |
Stratified rollups for paper / dashboard use. |
manifest.json |
New entry in summary_artifacts: corpus_aggregates: "corpus-aggregates.json". |
signal_version (currently "1.0") is present on every annotation. Treat it as opaque; surface only in admin/debug.
2. TypeScript types to add
Add to lib/backend-artifacts.ts (preferred β these are pipeline contract types):
// Spec Β§3
export interface ReproducibilityGap {
has_reproducibility_gap: boolean
missing_fields: string[] // e.g. ["max_tokens"]
required_field_count: number // 2 base + 2 if agentic on current runtime
populated_field_count: number
signal_version: string
}
// Spec Β§5
export type ProvenanceSourceType =
| "first_party"
| "third_party"
| "collaborative"
| "unspecified"
export interface Provenance {
source_type: ProvenanceSourceType
is_multi_source: boolean
first_party_only: boolean // see Β§6.1 below for caveat
distinct_reporting_organizations: number
signal_version: string
}
// Spec Β§6.1
export interface VariantDivergence {
has_variant_divergence: boolean
group_id: string // "{model_route_id}__{metric_summary_id}"
divergence_magnitude: number
threshold_used: number
threshold_basis:
| "proportion_or_continuous_normalized"
| "percent"
| "range_5pct"
| "fallback_default"
differing_setup_fields: Array<{ field: string; values: unknown[] }>
scores_in_group: number[]
this_triple_score: number | null // this row's score within the group
triple_count_in_group: number
score_scale_anomaly: boolean
group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
signal_version: string
}
// Spec Β§6.2
export interface CrossPartyDivergence {
has_cross_party_divergence: boolean
group_id: string
divergence_magnitude: number
threshold_used: number
threshold_basis: VariantDivergence["threshold_basis"]
scores_by_organization: Record<string, number> // display org name β score
differing_setup_fields: Array<{ field: string; values: unknown[] }>
organization_count: number
group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
signal_version: string
}
// Per-row annotation block (carried on every model_result row)
export interface RowAnnotations {
reproducibility_gap: ReproducibilityGap | null
provenance: Provenance | null
variant_divergence: VariantDivergence | null
cross_party_divergence: CrossPartyDivergence | null
}
// Spec Β§4
export interface ReportingCompleteness {
completeness_score: number // [0, 1]
total_fields_evaluated: number
missing_required_fields: string[] // dotted paths
partial_fields: Array<{
field_path: string
score: number // (0, 1) β strictly between
populated_subitems: number
total_subitems: number
}>
field_scores: Array<{
field_path: string
coverage_type: "full" | "partial" | "reserved"
score: number // [0, 1]
}>
signal_version: string
}
export interface BenchmarkComparability {
variant_divergence_groups: Array<{
group_id: string
model_route_id: string
divergence_magnitude: number
threshold_used: number
threshold_basis: VariantDivergence["threshold_basis"]
differing_setup_fields: VariantDivergence["differing_setup_fields"]
}>
cross_party_divergence_groups: Array<{
group_id: string
model_route_id: string
divergence_magnitude: number
threshold_used: number
threshold_basis: VariantDivergence["threshold_basis"]
scores_by_organization: Record<string, number>
differing_setup_fields: VariantDivergence["differing_setup_fields"]
}>
}
// Eval-root or model-root annotation block
export interface EvalcardsAnnotations {
reporting_completeness?: ReportingCompleteness
benchmark_comparability?: BenchmarkComparability
}
// Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes)
export interface ReproducibilitySummary {
results_total: number
has_reproducibility_gap_count: number
populated_ratio_avg: number | null // null when results_total == 0
}
export interface ProvenanceSummary {
total_results: number
total_groups: number
multi_source_groups: number
first_party_only_groups: number
source_type_distribution: Record<ProvenanceSourceType, number>
}
export interface ComparabilitySummary {
total_groups: number
groups_with_variant_check: number // eligible groups (>=2 rows, differing setups, >=2 scored)
groups_with_cross_party_check: number // eligible groups (>=2 named orgs)
variant_divergent_count: number
cross_party_divergent_count: number
}
export interface SignalSummaries {
reproducibility_summary?: ReproducibilitySummary
provenance_summary?: ProvenanceSummary
comparability_summary?: ComparabilitySummary
}
// corpus-aggregates.json
export interface CorpusAggregates {
generated_at: string
signal_version: string
stratification_dimensions: ["category"]
reproducibility: Stratified<ReproducibilityCorpusBlock>
completeness: Stratified<CompletenessCorpusBlock>
provenance: Stratified<ProvenanceCorpusBlock>
comparability: Stratified<ComparabilityCorpusBlock>
}
export interface Stratified<T> {
overall: T
by_category: Record<string, T> // categories: agentic | general | knowledge | reasoning | safety | other
}
export interface ReproducibilityCorpusBlock {
total_triples: number
triples_with_reproducibility_gap: number
reproducibility_gap_rate: number | null
agentic_triples: number
per_field_missingness: Record<string, {
missing_count: number
missing_rate: number | null
denominator: "all_triples" | "agentic_only"
denominator_count: number
}>
}
export interface CompletenessCorpusBlock {
total_benchmarks: number
completeness_score_mean: number | null
completeness_score_median: number | null
per_field_population: Record<string, {
mean_score: number
populated_rate: number
fully_populated_rate: number
benchmark_count: number
}>
}
export interface ProvenanceCorpusBlock {
total_triples: number
total_groups: number
multi_source_groups: number
multi_source_rate: number | null
first_party_only_groups: number
first_party_only_rate: number | null
source_type_distribution: Record<ProvenanceSourceType, number>
}
export interface ComparabilityCorpusBlock {
total_groups: number
variant_eligible_groups: number
variant_divergent_groups: number
variant_divergence_rate: number | null
cross_party_eligible_groups: number
cross_party_divergent_groups: number
cross_party_divergence_rate: number | null // commonly null on current corpus
}
Then in lib/hf-data.ts:
- Extend
HFEvalModelResult(line ~522) withevalcards?: { annotations?: RowAnnotations }. - Extend
HFEvalDetail(line ~556) withevalcards?: { annotations?: EvalcardsAnnotations }plus the three summary fields fromSignalSummaries. - Extend
HFEvalListEntry(line ~475) withSignalSummariesfields. - Extend
HFModelCardEntry(line ~439) withSignalSummariesfields. - Extend
HFModelDetail(line ~571) withSignalSummariesfields. - Extend
HFModelHierarchyMetric(line ~616) βmodel_resultsalready typed asHFEvalModelResult, so the per-row annotations propagate automatically.
In EvalHierarchy types (lib/backend-artifacts.ts line ~54), add SignalSummaries to both HierarchyFamily and HierarchyBenchmark.
All fields are optional at the type level β older cached snapshots won't have them, and the frontend should render gracefully when they're absent.
3. Data plumbing
3.1 New fetcher + API route for corpus aggregates
In lib/hf-data.ts, add after the existing fetchers (~line 866):
export async function fetchCorpusAggregates(): Promise<CorpusAggregates | null> {
return fetchHFJsonSafe<CorpusAggregates>("corpus-aggregates.json")
}
Add to scripts/cache-hf-data.mjs CACHE_ROOT_FILES array: "corpus-aggregates.json". (Mark it optional in OPTIONAL_CACHE_ROOT_FILES if shipping while the HF dataset upload is still rolling β once the backend pipeline next runs against the dataset, the file will appear.)
Create app/api/corpus-aggregates/route.ts:
import { NextResponse } from "next/server"
import { fetchCorpusAggregates } from "@/lib/hf-data"
export async function GET() {
const aggregates = await fetchCorpusAggregates()
if (!aggregates) {
return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 })
}
return NextResponse.json(aggregates)
}
3.2 Rest of plumbing is automatic
Existing fetchers (fetchEvalDetail, fetchModelDetail, fetchEvalList, fetchModelCardsList, fetchEvalHierarchy) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened.
4. UX components to build
Build a small set of reusable signal components in components/signals/. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across eval-detail.tsx, benchmark-detail.tsx, model-compare-dialog.tsx, and the new corpus dashboard.
components/signals/
βββ reproducibility-badge.tsx
βββ provenance-badge.tsx // already partially exists in benchmark-detail.tsx β see Β§4.2
βββ variant-divergence-badge.tsx
βββ cross-party-divergence-badge.tsx
βββ reproducibility-panel.tsx // detail view β full missing-fields list
βββ completeness-panel.tsx // detail view β score bar + missing-field list
βββ comparability-panel.tsx // detail view β divergent groups list
βββ signals-row-badges.tsx // composite: renders all four row-level badges with proper spacing
βββ signal-tooltip.tsx // shared tooltip primitive
All badges should follow the existing tone conventions used by getRelationshipBadgeTone (components/benchmark-detail.tsx:289) and the Badge primitive in components/ui/badge.tsx.
4.1 Row-level badges β placement
Insert <SignalsRowBadges annotations={modelResult.evalcards?.annotations} /> next to the score cell in:
- Eval detail leaderboard table β components/eval-detail.tsx:869-871 (the
<TableCell className="text-right">containing the score). Render badges below the score on a new line for desktop, hidden on mobile. - Benchmark detail rows β
components/benchmark-detail.tsxrenders score rows in several places (search forformatRawScoreValue); insert the same component. - Model compare dialog β components/model-compare-dialog.tsx score columns.
Display rules β only badge for actionable states. Silence is meaningful here.
| Signal | Show badge when | Hide when |
|---|---|---|
| Reproducibility | has_reproducibility_gap === true |
gap=false, or annotation absent |
| Provenance | source_type β {first_party, third_party, collaborative} |
source_type === "unspecified" |
| Variant divergence | variant_divergence !== null && has_variant_divergence === true |
null (not applicable) or false (checked, fine) |
| Cross-party divergence | cross_party_divergence !== null && has_cross_party_divergence === true |
null (almost always on current corpus) or false |
has_*: false means "we checked and it's fine" β silent success. null means "not applicable / not enough data" β also silent. Only divergent / gap-positive states warrant pixels.
Dedup rule. variant_divergence and cross_party_divergence are duplicated onto every row in the same group. If you render three rows from the same group_id, render the divergence badge on each row but the expanded panel (Β§4.4) only once at the group header.
4.2 Provenance badge β reuse what's there
components/benchmark-detail.tsx:262-302 already has getRelationshipShortLabel and getRelationshipBadgeTone. Extract these into components/signals/provenance-badge.tsx and import back into benchmark-detail.tsx. The new badge should also consume the new Provenance annotation when present (it carries is_multi_source and first_party_only, which the current implementation derives row-by-row from source_metadata alone).
When provenance.first_party_only === true, show a small β subtle indicator on the first-party badge ("first-party only β no independent replication"). This is the headline use of the signal for policy-mode readers.
4.3 Reproducibility badge β content rules
Tooltip content depends on audience mode (useAudienceMode() from components/audience-mode-provider.tsx:40):
- Research mode: "Setup not fully documented. Missing:
max_tokens,eval_plan." - Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is."
Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" β the active runtime checks 2 base fields (temperature, max_tokens) plus 2 agentic fields (eval_plan, eval_limits) when the benchmark is agentic. Read counts off the annotation.
4.4 Detail panels β placement
Reproducibility panel
The existing "Evaluation Provenance" panel in components/eval-detail.tsx:952-998 (rendered when a row is expanded) is the right place for the per-row reproducibility breakdown. Add a new DetailPanel adjacent to it:
{rowAnnotations?.reproducibility_gap && (
<DetailPanel
title={isResearchView ? "Reproducibility" : "Re-runnability"}
subtitle={
isResearchView
? "Whether the setup is documented well enough for someone else to re-run."
: "Whether someone could re-run this evaluation with the information available."
}
>
<MetaRow
label="Setup fields recorded"
value={`${rowAnnotations.reproducibility_gap.populated_field_count} of ${rowAnnotations.reproducibility_gap.required_field_count}`}
/>
{rowAnnotations.reproducibility_gap.missing_fields.length > 0 && (
<MetaRow
label="Missing"
value={rowAnnotations.reproducibility_gap.missing_fields.join(", ")}
/>
)}
</DetailPanel>
)}
Completeness panel
Render at the eval-detail header level (above the leaderboard, below the metric specification card). New <CompletenessPanel completeness={detail.evalcards?.annotations?.reporting_completeness} />. UI: progress bar showing completeness_score, label "{N} of {M} fields populated" where N = sum of field_scores[].score rounded, M = total_fields_evaluated. Below: collapsible accordions:
- Missing required fields (count badge) β list of
missing_required_fieldswith friendly labels (see Β§6.4 for label mapping). - Partially populated (count badge) β
partial_fieldsrendered as "{field}: {populated_subitems}/{total_subitems}".
In policy mode, don't show the dotted-path field names β show friendly labels only. In research mode, show both.
Comparability panel
Also at eval-detail header level. Sourced from detail.evalcards?.annotations?.benchmark_comparability. Render as two collapsibles β "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use model_route_id from each group entry as anchor β add id={"row-" + model_route_id} on the leaderboard row).
When both arrays are empty, hide the panel entirely. When comparability_summary.groups_with_cross_party_check === 0 (the common state), surface a small note: "No third-party reports available for cross-party comparison."
4.5 Per-eval header chips
On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around components/eval-detail.tsx:486-525), add a fourth chip when evalcards.annotations.reporting_completeness is present:
Documentation {round(completeness_score * 100)}%
Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark."
4.6 Per-model card chips
On components/eval-card.tsx and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at components/eval-card.tsx:250 ("Some results lack generation settings; compare scores with care.") with a data-driven version:
{has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented.
Show only when has_reproducibility_gap_count > 0. The hand-written hint was a placeholder for exactly this signal β wire it up.
5. New page: corpus dashboard
Add app/corpus/page.tsx (linked from main navigation components/navigation.tsx). Server component that calls fetchCorpusAggregates() and renders four sections:
5.1 Reproducibility section
- Headline number:
reproducibility_gap_raterendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores." - Per-field horizontal bar chart from
per_field_missingness. Bar denominator depends ondenominatorfield: agentic-only fields useagentic_triples, others usetotal_triples. Label each bar with the denominator type so users understand. - Toggle:
overallβby_category(rendered as a small-multiple grid, one panel per category).
5.2 Completeness section
- Headline:
completeness_score_mean(and median) acrosstotal_benchmarks. - Histogram of per-benchmark scores (pull individual benchmark scores from
eval-list.jsonreporting_completeness.completeness_score, since corpus-aggregates only carries mean/median). - Per-field bar chart from
per_field_populationβ three bars per field:mean_score,populated_rate,fully_populated_rate. (See Β§6.7 for which one to highlight per coverage type.)
5.3 Provenance section
- Stacked bar of
source_type_distribution(across all triples). - Two ratios:
multi_source_rate,first_party_only_rate. Label both: "% of (model, benchmark, metric) groups."
5.4 Comparability section
- Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null).
- When
cross_party_divergence_rate === null: show a "Not enough multi-org coverage to compute" empty state, not "0%". Same forvariant_divergence_rate === null. This is critical β see Β§6.5.
All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default).
6. Caveats and edge cases (read these before implementing)
6.1 first_party_only semantics
A row can be first_party_only: true even when is_multi_source: false. The spec literal: a group with one named org reporting first-party gets the badge. Don't read it as "exclusive coverage" β read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source."
If distinct_reporting_organizations === 0 (all rows have null org), first_party_only is false even when source_type === "first_party". Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge.
6.2 Active reproducibility field set is reduced
The spec describes four base fields (temperature, top_p, max_tokens, prompt_template); the active backend currently checks only temperature and max_tokens plus eval_plan / eval_limits for agentic benchmarks. Don't hardcode "4 fields" anywhere. Always read required_field_count off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface.
6.3 Missing-field path strings
missing_fields for reproducibility uses bare names (e.g. "max_tokens"). missing_required_fields for completeness uses dotted paths (e.g. "autobenchmarkcard.methodology.baseline_results"). Different conventions, intentional. Build a small label map for completeness paths β paths come from registry/completeness_fields.json on the backend repo. Suggested label rules:
- Drop the
autobenchmarkcard./eee_eval./evalcards.prefix. - Replace dots with " / ", underscore with space, title-case.
- Example:
autobenchmarkcard.methodology.baseline_resultsβ "Methodology / Baseline results".
6.4 differing_setup_fields[].values may contain null and mixed types
Per spec Β§6.1.4, null is a distinct value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render null as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with JSON.stringify for objects, plain text otherwise.
6.5 null rates in comparability are not zero
Eligibility-aware denominators mean variant_divergence_rate and cross_party_divergence_rate are null when no groups were eligible. Render as "N/A β not enough data" or an empty-state card, never as "0%". On the current corpus, cross_party_divergence_rate will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error.
6.6 Score-scale anomaly flag
variant_divergence.score_scale_anomaly === true indicates the metric was declared proportion but scores fell outside [0, 1] β usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples.
6.7 mean_score vs populated_rate for completeness
Per-field aggregates expose three numbers. Pick which to display based on coverage_type:
fullandreservedfields βmean_scoreandpopulated_rateare equal. Show one number labeled "% of benchmarks populating this field."partialfields β they diverge.populated_rate= % of benchmarks with any sub-item;mean_score= average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items."
6.8 No computed_at on per-record annotations
Only signal_version is on each annotation. For "last computed" UI text, use manifest.json β generated_at from the existing BackendManifest.
6.9 Stratification categories
by_category keys are: agentic, general, knowledge, reasoning, safety, other. Same set as the existing category field on evals β reuse whatever color scheme is currently keyed off inferCategoryFromBenchmark (lib/benchmark-schema.ts).
6.10 Annotation block can be null or absent
evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence} can each be null independently, and the entire evalcards block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The RowAnnotations type intentionally types each subfield as T | null (not T | undefined) because the backend writes explicit null.
7. Suggested implementation order
- Types + plumbing (1β2 hours): types in
backend-artifacts.ts+hf-data.ts, thefetchCorpusAggregatesfetcher, the API route, and addingcorpus-aggregates.jsonto the cache script. No UI yet. - Row-level badges (Β½ day): build
signals/directory with the four badge components, the dedup-awaresignals-row-badges.tsx, and wire into eval-detail and benchmark-detail. This is the most visible win. - Per-eval completeness panel + comparability panel (Β½ day): single benchmark, easy to design around. New
CompletenessPanelis the headline new UX in this set. - Per-row reproducibility detail panel (1β2 hours): drops into the existing expanded row layout.
- Per-eval / per-model header chips + replace the hand-written gap hint (1β2 hours): wires the summary fields into existing card surfaces.
- Corpus dashboard page (1β2 days): new route, new components, biggest scope. Defer until 1β5 are live and reviewed.
Each step is independently shippable. Steps 1β5 can land before the corpus dashboard is designed.
8. Out of scope (don't do these yet)
- Filter / sort the eval list by signal state ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want.
- Side-by-side score comparison with divergence overlay. The data supports it (
scores_in_group,scores_by_organization) but the design space is large. Hold off until we see the row-level badges in use. - Recompute / verification UI for missing reproducibility fields. Backend-side; out of scope here.
- Per-instance sample-level badges. Signals operate at row / benchmark level; sample-level instance data is unaffected.
9. Reference: minimal real-shape examples
Per-row evalcards.annotations with all four signals populated:
{
"reproducibility_gap": {
"has_reproducibility_gap": true,
"missing_fields": ["max_tokens"],
"required_field_count": 2,
"populated_field_count": 1,
"signal_version": "1.0"
},
"provenance": {
"source_type": "first_party",
"is_multi_source": false,
"first_party_only": true,
"distinct_reporting_organizations": 1,
"signal_version": "1.0"
},
"variant_divergence": null,
"cross_party_divergence": null
}
Per-eval evalcards.annotations with completeness + comparability:
{
"reporting_completeness": {
"completeness_score": 0.62,
"total_fields_evaluated": 28,
"missing_required_fields": [
"autobenchmarkcard.methodology.baseline_results",
"autobenchmarkcard.methodology.validation",
"evalcards.preregistration_url"
],
"partial_fields": [
{ "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 }
],
"field_scores": [/* 28 entries */],
"signal_version": "1.0"
},
"benchmark_comparability": {
"variant_divergence_groups": [
{
"group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy",
"model_route_id": "openai__gpt-5",
"divergence_magnitude": 0.12,
"threshold_used": 0.05,
"threshold_basis": "proportion_or_continuous_normalized",
"differing_setup_fields": [
{ "field": "max_tokens", "values": [2048, 4096, 8192] }
]
}
],
"cross_party_divergence_groups": []
}
}
Top-level provenance_summary example:
{
"total_results": 142,
"total_groups": 47,
"multi_source_groups": 3,
"first_party_only_groups": 30,
"source_type_distribution": {
"first_party": 120,
"third_party": 18,
"collaborative": 0,
"unspecified": 4
}
}
corpus-aggregates.json structure (top of file):
{
"generated_at": "2026-04-27T...",
"signal_version": "1.0",
"stratification_dimensions": ["category"],
"reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } },
"completeness": { "overall": {/* CompletenessCorpusBlock */}, "by_category": {...} },
"provenance": { "overall": {/* ProvenanceCorpusBlock */}, "by_category": {...} },
"comparability": { "overall": {/* ComparabilityCorpusBlock */}, "by_category": {...} }
}
10. Audience-mode wording cheatsheet
| Element | Research mode | Policy mode |
|---|---|---|
| Reproducibility gap badge | "Reproducibility gap" | "Setup not documented" |
| Reproducibility tooltip | "Setup not fully documented. Missing: {fields}." | "This score's setup isn't documented, so it can't be re-run as-is." |
| Reproducibility panel title | "Reproducibility" | "Re-runnability" |
| Completeness chip label | "Documentation" | "Documentation" |
| Completeness panel title | "Reporting completeness" | "How well is this benchmark documented?" |
| Provenance: first-party | "1st party" | "Reported by model developer" |
| Provenance: first-party only | "1st party only β no replication" | "Only the model developer reported this score" |
| Provenance: third-party | "3rd party" | "Independently reported" |
| Provenance: collaborative | "Collaborative" | "Joint report" |
| Variant divergence badge | "Variant divergence" | "Score depends on setup" |
| Variant divergence tooltip | "Scores diverge by {magnitude} across different setups: {fields}." | "Different runs of this evaluation produced different scores β the setup matters." |
| Cross-party divergence badge | "Cross-party divergence" | "Sources disagree" |
| Cross-party divergence tooltip | "Reports diverge by {magnitude} across organizations." | "Different organizations reported different scores for this same model on this same benchmark." |
Adjust tone but keep the underlying numbers identical across modes β the data is the same, only the framing changes.
Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics β eval_cards_backend_pipeline#2. Questions on UX β discuss with @anka-evals + frontend team.