Spaces:
Running
Running
File size: 32,635 Bytes
bca888a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 | # EvalCards interpretive signals β frontend implementation spec
**Status:** ready to implement. Backend ships in `evaleval/eval_cards_backend_pipeline` PR #1 (merged `b05323c`). All field shapes below are stable and covered by the backend's test suite.
**Companion docs:**
- Spec source of truth: *EvalCards Interpretive Signals v1.0* (Anka Reuel, Stanford). Section refs (Β§3, Β§4, β¦) below point at that doc.
- Open backend questions: [evaleval/eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). None block frontend work β they may shift wording, not shape.
---
## 0. What this PR does at a glance
The backend now annotates evaluation records with four interpretive signals:
1. **Reproducibility gap** β *per row.* Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing `max_tokens`").
2. **Reporting completeness** β *per benchmark.* What fraction of EvalCards-required documentation fields are populated? Surfaced as a `[0, 1]` score with a missing-field breakdown.
3. **Provenance** β *per row.* Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this `(model, benchmark, metric)` group?
4. **Comparability** β *per `(model, benchmark, metric)` group.* Two flavors: **variant divergence** (same model, same benchmark, different setups β diverging scores) and **cross-party divergence** (different orgs reporting β diverging scores).
Plus a corpus-level rollup file (`corpus-aggregates.json`) for a stratified analytics page.
The frontend's job: surface these signals **in three places** β row-level badges, per-eval / per-model summary panels, and a corpus dashboard view.
---
## 1. Where the new data lives
All fields are new additions to existing artifacts. No artifact is removed or reshaped.
| Artifact | New fields |
|---|---|
| `evals/{id}.json` (`HFEvalDetail`) | Per-row `evalcards.annotations` block on every `metrics[].model_results[]` and `subtasks[β¦].metrics[].model_results[]`. Plus eval-root `evalcards.annotations.reporting_completeness`, `evalcards.annotations.benchmark_comparability`, and three top-level summaries: `reproducibility_summary`, `provenance_summary`, `comparability_summary`. |
| `models/{id}.json` (`HFModelDetail`) | Per-row `evalcards.annotations` block on every `hierarchy_by_category[*][*].metrics[].model_results[]`. Plus three top-level summaries scoped to that model. |
| `eval-list.json` / `eval-list-lite.json` (`HFEvalListEntry`) | Three summaries per entry. |
| `model-cards.json` / `model-cards-lite.json` (`HFModelCardEntry`) | Three summaries per entry. |
| `eval-hierarchy.json` (`EvalHierarchy`) | Each family node and leaf node carries the three summaries (aggregated over evals under it). |
| **`corpus-aggregates.json` (NEW FILE)** | Stratified rollups for paper / dashboard use. |
| `manifest.json` | New entry in `summary_artifacts`: `corpus_aggregates: "corpus-aggregates.json"`. |
`signal_version` (currently `"1.0"`) is present on every annotation. Treat it as opaque; surface only in admin/debug.
---
## 2. TypeScript types to add
Add to `lib/backend-artifacts.ts` (preferred β these are pipeline contract types):
```ts
// Spec Β§3
export interface ReproducibilityGap {
has_reproducibility_gap: boolean
missing_fields: string[] // e.g. ["max_tokens"]
required_field_count: number // 2 base + 2 if agentic on current runtime
populated_field_count: number
signal_version: string
}
// Spec Β§5
export type ProvenanceSourceType =
| "first_party"
| "third_party"
| "collaborative"
| "unspecified"
export interface Provenance {
source_type: ProvenanceSourceType
is_multi_source: boolean
first_party_only: boolean // see Β§6.1 below for caveat
distinct_reporting_organizations: number
signal_version: string
}
// Spec Β§6.1
export interface VariantDivergence {
has_variant_divergence: boolean
group_id: string // "{model_route_id}__{metric_summary_id}"
divergence_magnitude: number
threshold_used: number
threshold_basis:
| "proportion_or_continuous_normalized"
| "percent"
| "range_5pct"
| "fallback_default"
differing_setup_fields: Array<{ field: string; values: unknown[] }>
scores_in_group: number[]
this_triple_score: number | null // this row's score within the group
triple_count_in_group: number
score_scale_anomaly: boolean
group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
signal_version: string
}
// Spec Β§6.2
export interface CrossPartyDivergence {
has_cross_party_divergence: boolean
group_id: string
divergence_magnitude: number
threshold_used: number
threshold_basis: VariantDivergence["threshold_basis"]
scores_by_organization: Record<string, number> // display org name β score
differing_setup_fields: Array<{ field: string; values: unknown[] }>
organization_count: number
group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
signal_version: string
}
// Per-row annotation block (carried on every model_result row)
export interface RowAnnotations {
reproducibility_gap: ReproducibilityGap | null
provenance: Provenance | null
variant_divergence: VariantDivergence | null
cross_party_divergence: CrossPartyDivergence | null
}
// Spec Β§4
export interface ReportingCompleteness {
completeness_score: number // [0, 1]
total_fields_evaluated: number
missing_required_fields: string[] // dotted paths
partial_fields: Array<{
field_path: string
score: number // (0, 1) β strictly between
populated_subitems: number
total_subitems: number
}>
field_scores: Array<{
field_path: string
coverage_type: "full" | "partial" | "reserved"
score: number // [0, 1]
}>
signal_version: string
}
export interface BenchmarkComparability {
variant_divergence_groups: Array<{
group_id: string
model_route_id: string
divergence_magnitude: number
threshold_used: number
threshold_basis: VariantDivergence["threshold_basis"]
differing_setup_fields: VariantDivergence["differing_setup_fields"]
}>
cross_party_divergence_groups: Array<{
group_id: string
model_route_id: string
divergence_magnitude: number
threshold_used: number
threshold_basis: VariantDivergence["threshold_basis"]
scores_by_organization: Record<string, number>
differing_setup_fields: VariantDivergence["differing_setup_fields"]
}>
}
// Eval-root or model-root annotation block
export interface EvalcardsAnnotations {
reporting_completeness?: ReportingCompleteness
benchmark_comparability?: BenchmarkComparability
}
// Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes)
export interface ReproducibilitySummary {
results_total: number
has_reproducibility_gap_count: number
populated_ratio_avg: number | null // null when results_total == 0
}
export interface ProvenanceSummary {
total_results: number
total_groups: number
multi_source_groups: number
first_party_only_groups: number
source_type_distribution: Record<ProvenanceSourceType, number>
}
export interface ComparabilitySummary {
total_groups: number
groups_with_variant_check: number // eligible groups (>=2 rows, differing setups, >=2 scored)
groups_with_cross_party_check: number // eligible groups (>=2 named orgs)
variant_divergent_count: number
cross_party_divergent_count: number
}
export interface SignalSummaries {
reproducibility_summary?: ReproducibilitySummary
provenance_summary?: ProvenanceSummary
comparability_summary?: ComparabilitySummary
}
// corpus-aggregates.json
export interface CorpusAggregates {
generated_at: string
signal_version: string
stratification_dimensions: ["category"]
reproducibility: Stratified<ReproducibilityCorpusBlock>
completeness: Stratified<CompletenessCorpusBlock>
provenance: Stratified<ProvenanceCorpusBlock>
comparability: Stratified<ComparabilityCorpusBlock>
}
export interface Stratified<T> {
overall: T
by_category: Record<string, T> // categories: agentic | general | knowledge | reasoning | safety | other
}
export interface ReproducibilityCorpusBlock {
total_triples: number
triples_with_reproducibility_gap: number
reproducibility_gap_rate: number | null
agentic_triples: number
per_field_missingness: Record<string, {
missing_count: number
missing_rate: number | null
denominator: "all_triples" | "agentic_only"
denominator_count: number
}>
}
export interface CompletenessCorpusBlock {
total_benchmarks: number
completeness_score_mean: number | null
completeness_score_median: number | null
per_field_population: Record<string, {
mean_score: number
populated_rate: number
fully_populated_rate: number
benchmark_count: number
}>
}
export interface ProvenanceCorpusBlock {
total_triples: number
total_groups: number
multi_source_groups: number
multi_source_rate: number | null
first_party_only_groups: number
first_party_only_rate: number | null
source_type_distribution: Record<ProvenanceSourceType, number>
}
export interface ComparabilityCorpusBlock {
total_groups: number
variant_eligible_groups: number
variant_divergent_groups: number
variant_divergence_rate: number | null
cross_party_eligible_groups: number
cross_party_divergent_groups: number
cross_party_divergence_rate: number | null // commonly null on current corpus
}
```
Then in `lib/hf-data.ts`:
- Extend `HFEvalModelResult` (line ~522) with `evalcards?: { annotations?: RowAnnotations }`.
- Extend `HFEvalDetail` (line ~556) with `evalcards?: { annotations?: EvalcardsAnnotations }` plus the three summary fields from `SignalSummaries`.
- Extend `HFEvalListEntry` (line ~475) with `SignalSummaries` fields.
- Extend `HFModelCardEntry` (line ~439) with `SignalSummaries` fields.
- Extend `HFModelDetail` (line ~571) with `SignalSummaries` fields.
- Extend `HFModelHierarchyMetric` (line ~616) β `model_results` already typed as `HFEvalModelResult`, so the per-row annotations propagate automatically.
In `EvalHierarchy` types (`lib/backend-artifacts.ts` line ~54), add `SignalSummaries` to both `HierarchyFamily` and `HierarchyBenchmark`.
All fields are **optional** at the type level β older cached snapshots won't have them, and the frontend should render gracefully when they're absent.
---
## 3. Data plumbing
### 3.1 New fetcher + API route for corpus aggregates
In `lib/hf-data.ts`, add after the existing fetchers (~line 866):
```ts
export async function fetchCorpusAggregates(): Promise<CorpusAggregates | null> {
return fetchHFJsonSafe<CorpusAggregates>("corpus-aggregates.json")
}
```
Add to `scripts/cache-hf-data.mjs` `CACHE_ROOT_FILES` array: `"corpus-aggregates.json"`. (Mark it optional in `OPTIONAL_CACHE_ROOT_FILES` if shipping while the HF dataset upload is still rolling β once the backend pipeline next runs against the dataset, the file will appear.)
Create `app/api/corpus-aggregates/route.ts`:
```ts
import { NextResponse } from "next/server"
import { fetchCorpusAggregates } from "@/lib/hf-data"
export async function GET() {
const aggregates = await fetchCorpusAggregates()
if (!aggregates) {
return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 })
}
return NextResponse.json(aggregates)
}
```
### 3.2 Rest of plumbing is automatic
Existing fetchers (`fetchEvalDetail`, `fetchModelDetail`, `fetchEvalList`, `fetchModelCardsList`, `fetchEvalHierarchy`) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened.
---
## 4. UX components to build
Build a small set of reusable signal components in `components/signals/`. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across `eval-detail.tsx`, `benchmark-detail.tsx`, `model-compare-dialog.tsx`, and the new corpus dashboard.
```
components/signals/
βββ reproducibility-badge.tsx
βββ provenance-badge.tsx // already partially exists in benchmark-detail.tsx β see Β§4.2
βββ variant-divergence-badge.tsx
βββ cross-party-divergence-badge.tsx
βββ reproducibility-panel.tsx // detail view β full missing-fields list
βββ completeness-panel.tsx // detail view β score bar + missing-field list
βββ comparability-panel.tsx // detail view β divergent groups list
βββ signals-row-badges.tsx // composite: renders all four row-level badges with proper spacing
βββ signal-tooltip.tsx // shared tooltip primitive
```
All badges should follow the existing tone conventions used by `getRelationshipBadgeTone` ([components/benchmark-detail.tsx:289](../components/benchmark-detail.tsx#L289)) and the `Badge` primitive in [components/ui/badge.tsx](../components/ui/badge.tsx).
### 4.1 Row-level badges β placement
Insert `<SignalsRowBadges annotations={modelResult.evalcards?.annotations} />` next to the score cell in:
- **Eval detail leaderboard table** β [components/eval-detail.tsx:869-871](../components/eval-detail.tsx#L869-L871) (the `<TableCell className="text-right">` containing the score). Render badges below the score on a new line for desktop, hidden on mobile.
- **Benchmark detail rows** β `components/benchmark-detail.tsx` renders score rows in several places (search for `formatRawScoreValue`); insert the same component.
- **Model compare dialog** β [components/model-compare-dialog.tsx](../components/model-compare-dialog.tsx) score columns.
**Display rules β only badge for actionable states.** Silence is meaningful here.
| Signal | Show badge when | Hide when |
|---|---|---|
| Reproducibility | `has_reproducibility_gap === true` | gap=false, or annotation absent |
| Provenance | `source_type` β {`first_party`, `third_party`, `collaborative`} | `source_type === "unspecified"` |
| Variant divergence | `variant_divergence !== null && has_variant_divergence === true` | null (not applicable) or false (checked, fine) |
| Cross-party divergence | `cross_party_divergence !== null && has_cross_party_divergence === true` | null (almost always on current corpus) or false |
`has_*: false` means "we checked and it's fine" β silent success. `null` means "not applicable / not enough data" β also silent. **Only divergent / gap-positive states warrant pixels.**
**Dedup rule.** `variant_divergence` and `cross_party_divergence` are duplicated onto every row in the same group. If you render three rows from the same `group_id`, render the divergence badge on each row but the *expanded panel* (Β§4.4) only once at the group header.
### 4.2 Provenance badge β reuse what's there
[components/benchmark-detail.tsx:262-302](../components/benchmark-detail.tsx#L262-L302) already has `getRelationshipShortLabel` and `getRelationshipBadgeTone`. Extract these into `components/signals/provenance-badge.tsx` and import back into `benchmark-detail.tsx`. The new badge should **also** consume the new `Provenance` annotation when present (it carries `is_multi_source` and `first_party_only`, which the current implementation derives row-by-row from `source_metadata` alone).
When `provenance.first_party_only === true`, show a small β subtle indicator on the first-party badge ("first-party only β no independent replication"). This is the headline use of the signal for policy-mode readers.
### 4.3 Reproducibility badge β content rules
Tooltip content depends on audience mode (`useAudienceMode()` from [components/audience-mode-provider.tsx:40](../components/audience-mode-provider.tsx#L40)):
- Research mode: "Setup not fully documented. Missing: `max_tokens`, `eval_plan`."
- Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is."
Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" β the active runtime checks 2 base fields (`temperature`, `max_tokens`) plus 2 agentic fields (`eval_plan`, `eval_limits`) when the benchmark is agentic. Read counts off the annotation.
### 4.4 Detail panels β placement
#### Reproducibility panel
The existing "Evaluation Provenance" panel in [components/eval-detail.tsx:952-998](../components/eval-detail.tsx#L952-L998) (rendered when a row is expanded) is the right place for the **per-row** reproducibility breakdown. Add a new `DetailPanel` adjacent to it:
```tsx
{rowAnnotations?.reproducibility_gap && (
<DetailPanel
title={isResearchView ? "Reproducibility" : "Re-runnability"}
subtitle={
isResearchView
? "Whether the setup is documented well enough for someone else to re-run."
: "Whether someone could re-run this evaluation with the information available."
}
>
<MetaRow
label="Setup fields recorded"
value={`${rowAnnotations.reproducibility_gap.populated_field_count} of ${rowAnnotations.reproducibility_gap.required_field_count}`}
/>
{rowAnnotations.reproducibility_gap.missing_fields.length > 0 && (
<MetaRow
label="Missing"
value={rowAnnotations.reproducibility_gap.missing_fields.join(", ")}
/>
)}
</DetailPanel>
)}
```
#### Completeness panel
Render at the **eval-detail header level** (above the leaderboard, below the metric specification card). New `<CompletenessPanel completeness={detail.evalcards?.annotations?.reporting_completeness} />`. UI: progress bar showing `completeness_score`, label "{N} of {M} fields populated" where N = sum of `field_scores[].score` rounded, M = `total_fields_evaluated`. Below: collapsible accordions:
- **Missing required fields** (count badge) β list of `missing_required_fields` with friendly labels (see Β§6.4 for label mapping).
- **Partially populated** (count badge) β `partial_fields` rendered as "{field}: {populated_subitems}/{total_subitems}".
In policy mode, don't show the dotted-path field names β show friendly labels only. In research mode, show both.
#### Comparability panel
Also at eval-detail header level. Sourced from `detail.evalcards?.annotations?.benchmark_comparability`. Render as two collapsibles β "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use `model_route_id` from each group entry as anchor β add `id={"row-" + model_route_id}` on the leaderboard row).
When both arrays are empty, hide the panel entirely. When `comparability_summary.groups_with_cross_party_check === 0` (the common state), surface a small note: "No third-party reports available for cross-party comparison."
### 4.5 Per-eval header chips
On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around [components/eval-detail.tsx:486-525](../components/eval-detail.tsx#L486-L525)), add a fourth chip when `evalcards.annotations.reporting_completeness` is present:
> **Documentation**
> {round(completeness_score * 100)}%
Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark."
### 4.6 Per-model card chips
On `components/eval-card.tsx` and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at [components/eval-card.tsx:250](../components/eval-card.tsx#L250) ("Some results lack generation settings; compare scores with care.") with a data-driven version:
> {has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented.
Show only when `has_reproducibility_gap_count > 0`. The hand-written hint was a placeholder for exactly this signal β wire it up.
---
## 5. New page: corpus dashboard
Add `app/corpus/page.tsx` (linked from main navigation [components/navigation.tsx](../components/navigation.tsx)). Server component that calls `fetchCorpusAggregates()` and renders four sections:
### 5.1 Reproducibility section
- Headline number: `reproducibility_gap_rate` rendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores."
- Per-field horizontal bar chart from `per_field_missingness`. **Bar denominator depends on `denominator` field**: agentic-only fields use `agentic_triples`, others use `total_triples`. Label each bar with the denominator type so users understand.
- Toggle: `overall` β `by_category` (rendered as a small-multiple grid, one panel per category).
### 5.2 Completeness section
- Headline: `completeness_score_mean` (and median) across `total_benchmarks`.
- Histogram of per-benchmark scores (pull individual benchmark scores from `eval-list.json` `reporting_completeness.completeness_score`, since corpus-aggregates only carries mean/median).
- Per-field bar chart from `per_field_population` β three bars per field: `mean_score`, `populated_rate`, `fully_populated_rate`. (See Β§6.7 for which one to highlight per coverage type.)
### 5.3 Provenance section
- Stacked bar of `source_type_distribution` (across all triples).
- Two ratios: `multi_source_rate`, `first_party_only_rate`. Label both: "% of (model, benchmark, metric) groups."
### 5.4 Comparability section
- Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null).
- **When `cross_party_divergence_rate === null`:** show a "Not enough multi-org coverage to compute" empty state, not "0%". Same for `variant_divergence_rate === null`. This is critical β see Β§6.5.
All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default).
---
## 6. Caveats and edge cases (read these before implementing)
### 6.1 `first_party_only` semantics
A row can be `first_party_only: true` even when `is_multi_source: false`. The spec literal: a group with one *named* org reporting first-party gets the badge. **Don't read it as "exclusive coverage"** β read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source."
If `distinct_reporting_organizations === 0` (all rows have null org), `first_party_only` is `false` even when `source_type === "first_party"`. Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge.
### 6.2 Active reproducibility field set is reduced
The spec describes four base fields (`temperature`, `top_p`, `max_tokens`, `prompt_template`); the active backend currently checks **only `temperature` and `max_tokens`** plus `eval_plan` / `eval_limits` for agentic benchmarks. **Don't hardcode "4 fields" anywhere.** Always read `required_field_count` off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface.
### 6.3 Missing-field path strings
`missing_fields` for reproducibility uses bare names (e.g. `"max_tokens"`). `missing_required_fields` for completeness uses dotted paths (e.g. `"autobenchmarkcard.methodology.baseline_results"`). Different conventions, intentional. Build a small label map for completeness paths β paths come from [registry/completeness_fields.json](https://github.com/evaleval/eval_cards_backend_pipeline/blob/main/registry/completeness_fields.json) on the backend repo. Suggested label rules:
- Drop the `autobenchmarkcard.` / `eee_eval.` / `evalcards.` prefix.
- Replace dots with " / ", underscore with space, title-case.
- Example: `autobenchmarkcard.methodology.baseline_results` β "Methodology / Baseline results".
### 6.4 `differing_setup_fields[].values` may contain null and mixed types
Per spec Β§6.1.4, `null` is a *distinct* value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render `null` as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with `JSON.stringify` for objects, plain text otherwise.
### 6.5 `null` rates in comparability are *not* zero
Eligibility-aware denominators mean `variant_divergence_rate` and `cross_party_divergence_rate` are `null` when no groups were eligible. **Render as "N/A β not enough data" or an empty-state card, never as "0%".** On the current corpus, `cross_party_divergence_rate` will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error.
### 6.6 Score-scale anomaly flag
`variant_divergence.score_scale_anomaly === true` indicates the metric was declared `proportion` but scores fell outside [0, 1] β usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples.
### 6.7 `mean_score` vs `populated_rate` for completeness
Per-field aggregates expose three numbers. Pick which to display based on `coverage_type`:
- **`full` and `reserved` fields** β `mean_score` and `populated_rate` are equal. Show one number labeled "% of benchmarks populating this field."
- **`partial` fields** β they diverge. `populated_rate` = % of benchmarks with *any* sub-item; `mean_score` = average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items."
### 6.8 No `computed_at` on per-record annotations
Only `signal_version` is on each annotation. For "last computed" UI text, use `manifest.json β generated_at` from the existing `BackendManifest`.
### 6.9 Stratification categories
`by_category` keys are: `agentic`, `general`, `knowledge`, `reasoning`, `safety`, `other`. Same set as the existing `category` field on evals β reuse whatever color scheme is currently keyed off `inferCategoryFromBenchmark` ([lib/benchmark-schema.ts](../lib/benchmark-schema.ts)).
### 6.10 Annotation block can be `null` or absent
`evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence}` can each be `null` independently, and the entire `evalcards` block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The `RowAnnotations` type intentionally types each subfield as `T | null` (not `T | undefined`) because the backend writes explicit `null`.
---
## 7. Suggested implementation order
1. **Types + plumbing** (1β2 hours): types in `backend-artifacts.ts` + `hf-data.ts`, the `fetchCorpusAggregates` fetcher, the API route, and adding `corpus-aggregates.json` to the cache script. No UI yet.
2. **Row-level badges** (Β½ day): build `signals/` directory with the four badge components, the dedup-aware `signals-row-badges.tsx`, and wire into eval-detail and benchmark-detail. This is the most visible win.
3. **Per-eval completeness panel + comparability panel** (Β½ day): single benchmark, easy to design around. New `CompletenessPanel` is the headline new UX in this set.
4. **Per-row reproducibility detail panel** (1β2 hours): drops into the existing expanded row layout.
5. **Per-eval / per-model header chips + replace the hand-written gap hint** (1β2 hours): wires the summary fields into existing card surfaces.
6. **Corpus dashboard page** (1β2 days): new route, new components, biggest scope. Defer until 1β5 are live and reviewed.
Each step is independently shippable. Steps 1β5 can land before the corpus dashboard is designed.
---
## 8. Out of scope (don't do these yet)
- **Filter / sort the eval list by signal state** ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want.
- **Side-by-side score comparison with divergence overlay.** The data supports it (`scores_in_group`, `scores_by_organization`) but the design space is large. Hold off until we see the row-level badges in use.
- **Recompute / verification UI for missing reproducibility fields.** Backend-side; out of scope here.
- **Per-instance sample-level badges.** Signals operate at row / benchmark level; sample-level instance data is unaffected.
---
## 9. Reference: minimal real-shape examples
Per-row `evalcards.annotations` with all four signals populated:
```jsonc
{
"reproducibility_gap": {
"has_reproducibility_gap": true,
"missing_fields": ["max_tokens"],
"required_field_count": 2,
"populated_field_count": 1,
"signal_version": "1.0"
},
"provenance": {
"source_type": "first_party",
"is_multi_source": false,
"first_party_only": true,
"distinct_reporting_organizations": 1,
"signal_version": "1.0"
},
"variant_divergence": null,
"cross_party_divergence": null
}
```
Per-eval `evalcards.annotations` with completeness + comparability:
```jsonc
{
"reporting_completeness": {
"completeness_score": 0.62,
"total_fields_evaluated": 28,
"missing_required_fields": [
"autobenchmarkcard.methodology.baseline_results",
"autobenchmarkcard.methodology.validation",
"evalcards.preregistration_url"
],
"partial_fields": [
{ "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 }
],
"field_scores": [/* 28 entries */],
"signal_version": "1.0"
},
"benchmark_comparability": {
"variant_divergence_groups": [
{
"group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy",
"model_route_id": "openai__gpt-5",
"divergence_magnitude": 0.12,
"threshold_used": 0.05,
"threshold_basis": "proportion_or_continuous_normalized",
"differing_setup_fields": [
{ "field": "max_tokens", "values": [2048, 4096, 8192] }
]
}
],
"cross_party_divergence_groups": []
}
}
```
Top-level `provenance_summary` example:
```jsonc
{
"total_results": 142,
"total_groups": 47,
"multi_source_groups": 3,
"first_party_only_groups": 30,
"source_type_distribution": {
"first_party": 120,
"third_party": 18,
"collaborative": 0,
"unspecified": 4
}
}
```
`corpus-aggregates.json` structure (top of file):
```jsonc
{
"generated_at": "2026-04-27T...",
"signal_version": "1.0",
"stratification_dimensions": ["category"],
"reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } },
"completeness": { "overall": {/* CompletenessCorpusBlock */}, "by_category": {...} },
"provenance": { "overall": {/* ProvenanceCorpusBlock */}, "by_category": {...} },
"comparability": { "overall": {/* ComparabilityCorpusBlock */}, "by_category": {...} }
}
```
---
## 10. Audience-mode wording cheatsheet
| Element | Research mode | Policy mode |
|---|---|---|
| Reproducibility gap badge | "Reproducibility gap" | "Setup not documented" |
| Reproducibility tooltip | "Setup not fully documented. Missing: {fields}." | "This score's setup isn't documented, so it can't be re-run as-is." |
| Reproducibility panel title | "Reproducibility" | "Re-runnability" |
| Completeness chip label | "Documentation" | "Documentation" |
| Completeness panel title | "Reporting completeness" | "How well is this benchmark documented?" |
| Provenance: first-party | "1st party" | "Reported by model developer" |
| Provenance: first-party only | "1st party only β no replication" | "Only the model developer reported this score" |
| Provenance: third-party | "3rd party" | "Independently reported" |
| Provenance: collaborative | "Collaborative" | "Joint report" |
| Variant divergence badge | "Variant divergence" | "Score depends on setup" |
| Variant divergence tooltip | "Scores diverge by {magnitude} across different setups: {fields}." | "Different runs of this evaluation produced different scores β the setup matters." |
| Cross-party divergence badge | "Cross-party divergence" | "Sources disagree" |
| Cross-party divergence tooltip | "Reports diverge by {magnitude} across organizations." | "Different organizations reported different scores for this same model on this same benchmark." |
Adjust tone but keep the underlying numbers identical across modes β the data is the same, only the framing changes.
---
*Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics β [eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). Questions on UX β discuss with @anka-evals + frontend team.*
|