File size: 32,635 Bytes
bca888a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
# EvalCards interpretive signals β€” frontend implementation spec

**Status:** ready to implement. Backend ships in `evaleval/eval_cards_backend_pipeline` PR #1 (merged `b05323c`). All field shapes below are stable and covered by the backend's test suite.

**Companion docs:**
- Spec source of truth: *EvalCards Interpretive Signals v1.0* (Anka Reuel, Stanford). Section refs (Β§3, Β§4, …) below point at that doc.
- Open backend questions: [evaleval/eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). None block frontend work β€” they may shift wording, not shape.

---

## 0. What this PR does at a glance

The backend now annotates evaluation records with four interpretive signals:

1. **Reproducibility gap** β€” *per row.* Was the evaluation documented well enough to be re-run? Surfaced as a missing-fields list (e.g. "missing `max_tokens`").
2. **Reporting completeness** β€” *per benchmark.* What fraction of EvalCards-required documentation fields are populated? Surfaced as a `[0, 1]` score with a missing-field breakdown.
3. **Provenance** β€” *per row.* Who reported this score (first-party / third-party / collaborative / unspecified), and is it the only source for this `(model, benchmark, metric)` group?
4. **Comparability** β€” *per `(model, benchmark, metric)` group.* Two flavors: **variant divergence** (same model, same benchmark, different setups β†’ diverging scores) and **cross-party divergence** (different orgs reporting β†’ diverging scores).

Plus a corpus-level rollup file (`corpus-aggregates.json`) for a stratified analytics page.

The frontend's job: surface these signals **in three places** β€” row-level badges, per-eval / per-model summary panels, and a corpus dashboard view.

---

## 1. Where the new data lives

All fields are new additions to existing artifacts. No artifact is removed or reshaped.

| Artifact | New fields |
|---|---|
| `evals/{id}.json` (`HFEvalDetail`) | Per-row `evalcards.annotations` block on every `metrics[].model_results[]` and `subtasks[…].metrics[].model_results[]`. Plus eval-root `evalcards.annotations.reporting_completeness`, `evalcards.annotations.benchmark_comparability`, and three top-level summaries: `reproducibility_summary`, `provenance_summary`, `comparability_summary`. |
| `models/{id}.json` (`HFModelDetail`) | Per-row `evalcards.annotations` block on every `hierarchy_by_category[*][*].metrics[].model_results[]`. Plus three top-level summaries scoped to that model. |
| `eval-list.json` / `eval-list-lite.json` (`HFEvalListEntry`) | Three summaries per entry. |
| `model-cards.json` / `model-cards-lite.json` (`HFModelCardEntry`) | Three summaries per entry. |
| `eval-hierarchy.json` (`EvalHierarchy`) | Each family node and leaf node carries the three summaries (aggregated over evals under it). |
| **`corpus-aggregates.json` (NEW FILE)** | Stratified rollups for paper / dashboard use. |
| `manifest.json` | New entry in `summary_artifacts`: `corpus_aggregates: "corpus-aggregates.json"`. |

`signal_version` (currently `"1.0"`) is present on every annotation. Treat it as opaque; surface only in admin/debug.

---

## 2. TypeScript types to add

Add to `lib/backend-artifacts.ts` (preferred β€” these are pipeline contract types):

```ts
// Spec Β§3
export interface ReproducibilityGap {
  has_reproducibility_gap: boolean
  missing_fields: string[]              // e.g. ["max_tokens"]
  required_field_count: number          // 2 base + 2 if agentic on current runtime
  populated_field_count: number
  signal_version: string
}

// Spec Β§5
export type ProvenanceSourceType =
  | "first_party"
  | "third_party"
  | "collaborative"
  | "unspecified"

export interface Provenance {
  source_type: ProvenanceSourceType
  is_multi_source: boolean
  first_party_only: boolean             // see Β§6.1 below for caveat
  distinct_reporting_organizations: number
  signal_version: string
}

// Spec Β§6.1
export interface VariantDivergence {
  has_variant_divergence: boolean
  group_id: string                      // "{model_route_id}__{metric_summary_id}"
  divergence_magnitude: number
  threshold_used: number
  threshold_basis:
    | "proportion_or_continuous_normalized"
    | "percent"
    | "range_5pct"
    | "fallback_default"
  differing_setup_fields: Array<{ field: string; values: unknown[] }>
  scores_in_group: number[]
  this_triple_score: number | null      // this row's score within the group
  triple_count_in_group: number
  score_scale_anomaly: boolean
  group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
  signal_version: string
}

// Spec Β§6.2
export interface CrossPartyDivergence {
  has_cross_party_divergence: boolean
  group_id: string
  divergence_magnitude: number
  threshold_used: number
  threshold_basis: VariantDivergence["threshold_basis"]
  scores_by_organization: Record<string, number>   // display org name β†’ score
  differing_setup_fields: Array<{ field: string; values: unknown[] }>
  organization_count: number
  group_variant_breakdown: Array<{ variant_key: string; row_count: number }>
  signal_version: string
}

// Per-row annotation block (carried on every model_result row)
export interface RowAnnotations {
  reproducibility_gap: ReproducibilityGap | null
  provenance: Provenance | null
  variant_divergence: VariantDivergence | null
  cross_party_divergence: CrossPartyDivergence | null
}

// Spec Β§4
export interface ReportingCompleteness {
  completeness_score: number            // [0, 1]
  total_fields_evaluated: number
  missing_required_fields: string[]     // dotted paths
  partial_fields: Array<{
    field_path: string
    score: number                       // (0, 1) β€” strictly between
    populated_subitems: number
    total_subitems: number
  }>
  field_scores: Array<{
    field_path: string
    coverage_type: "full" | "partial" | "reserved"
    score: number                       // [0, 1]
  }>
  signal_version: string
}

export interface BenchmarkComparability {
  variant_divergence_groups: Array<{
    group_id: string
    model_route_id: string
    divergence_magnitude: number
    threshold_used: number
    threshold_basis: VariantDivergence["threshold_basis"]
    differing_setup_fields: VariantDivergence["differing_setup_fields"]
  }>
  cross_party_divergence_groups: Array<{
    group_id: string
    model_route_id: string
    divergence_magnitude: number
    threshold_used: number
    threshold_basis: VariantDivergence["threshold_basis"]
    scores_by_organization: Record<string, number>
    differing_setup_fields: VariantDivergence["differing_setup_fields"]
  }>
}

// Eval-root or model-root annotation block
export interface EvalcardsAnnotations {
  reporting_completeness?: ReportingCompleteness
  benchmark_comparability?: BenchmarkComparability
}

// Top-level summary blocks (present on eval-list / model-cards / eval / model / hierarchy nodes)
export interface ReproducibilitySummary {
  results_total: number
  has_reproducibility_gap_count: number
  populated_ratio_avg: number | null    // null when results_total == 0
}

export interface ProvenanceSummary {
  total_results: number
  total_groups: number
  multi_source_groups: number
  first_party_only_groups: number
  source_type_distribution: Record<ProvenanceSourceType, number>
}

export interface ComparabilitySummary {
  total_groups: number
  groups_with_variant_check: number     // eligible groups (>=2 rows, differing setups, >=2 scored)
  groups_with_cross_party_check: number // eligible groups (>=2 named orgs)
  variant_divergent_count: number
  cross_party_divergent_count: number
}

export interface SignalSummaries {
  reproducibility_summary?: ReproducibilitySummary
  provenance_summary?: ProvenanceSummary
  comparability_summary?: ComparabilitySummary
}

// corpus-aggregates.json
export interface CorpusAggregates {
  generated_at: string
  signal_version: string
  stratification_dimensions: ["category"]
  reproducibility: Stratified<ReproducibilityCorpusBlock>
  completeness:   Stratified<CompletenessCorpusBlock>
  provenance:     Stratified<ProvenanceCorpusBlock>
  comparability:  Stratified<ComparabilityCorpusBlock>
}

export interface Stratified<T> {
  overall: T
  by_category: Record<string, T>        // categories: agentic | general | knowledge | reasoning | safety | other
}

export interface ReproducibilityCorpusBlock {
  total_triples: number
  triples_with_reproducibility_gap: number
  reproducibility_gap_rate: number | null
  agentic_triples: number
  per_field_missingness: Record<string, {
    missing_count: number
    missing_rate: number | null
    denominator: "all_triples" | "agentic_only"
    denominator_count: number
  }>
}

export interface CompletenessCorpusBlock {
  total_benchmarks: number
  completeness_score_mean: number | null
  completeness_score_median: number | null
  per_field_population: Record<string, {
    mean_score: number
    populated_rate: number
    fully_populated_rate: number
    benchmark_count: number
  }>
}

export interface ProvenanceCorpusBlock {
  total_triples: number
  total_groups: number
  multi_source_groups: number
  multi_source_rate: number | null
  first_party_only_groups: number
  first_party_only_rate: number | null
  source_type_distribution: Record<ProvenanceSourceType, number>
}

export interface ComparabilityCorpusBlock {
  total_groups: number
  variant_eligible_groups: number
  variant_divergent_groups: number
  variant_divergence_rate: number | null
  cross_party_eligible_groups: number
  cross_party_divergent_groups: number
  cross_party_divergence_rate: number | null   // commonly null on current corpus
}
```

Then in `lib/hf-data.ts`:

- Extend `HFEvalModelResult` (line ~522) with `evalcards?: { annotations?: RowAnnotations }`.
- Extend `HFEvalDetail` (line ~556) with `evalcards?: { annotations?: EvalcardsAnnotations }` plus the three summary fields from `SignalSummaries`.
- Extend `HFEvalListEntry` (line ~475) with `SignalSummaries` fields.
- Extend `HFModelCardEntry` (line ~439) with `SignalSummaries` fields.
- Extend `HFModelDetail` (line ~571) with `SignalSummaries` fields.
- Extend `HFModelHierarchyMetric` (line ~616) β€” `model_results` already typed as `HFEvalModelResult`, so the per-row annotations propagate automatically.

In `EvalHierarchy` types (`lib/backend-artifacts.ts` line ~54), add `SignalSummaries` to both `HierarchyFamily` and `HierarchyBenchmark`.

All fields are **optional** at the type level β€” older cached snapshots won't have them, and the frontend should render gracefully when they're absent.

---

## 3. Data plumbing

### 3.1 New fetcher + API route for corpus aggregates

In `lib/hf-data.ts`, add after the existing fetchers (~line 866):

```ts
export async function fetchCorpusAggregates(): Promise<CorpusAggregates | null> {
  return fetchHFJsonSafe<CorpusAggregates>("corpus-aggregates.json")
}
```

Add to `scripts/cache-hf-data.mjs` `CACHE_ROOT_FILES` array: `"corpus-aggregates.json"`. (Mark it optional in `OPTIONAL_CACHE_ROOT_FILES` if shipping while the HF dataset upload is still rolling β€” once the backend pipeline next runs against the dataset, the file will appear.)

Create `app/api/corpus-aggregates/route.ts`:

```ts
import { NextResponse } from "next/server"
import { fetchCorpusAggregates } from "@/lib/hf-data"

export async function GET() {
  const aggregates = await fetchCorpusAggregates()
  if (!aggregates) {
    return NextResponse.json({ error: "Corpus aggregates not available" }, { status: 404 })
  }
  return NextResponse.json(aggregates)
}
```

### 3.2 Rest of plumbing is automatic

Existing fetchers (`fetchEvalDetail`, `fetchModelDetail`, `fetchEvalList`, `fetchModelCardsList`, `fetchEvalHierarchy`) just pull the raw JSON, so the new fields propagate without code changes once the types above are widened.

---

## 4. UX components to build

Build a small set of reusable signal components in `components/signals/`. Each takes one of the typed shapes above and renders a badge / panel. This keeps signal rendering consistent across `eval-detail.tsx`, `benchmark-detail.tsx`, `model-compare-dialog.tsx`, and the new corpus dashboard.

```
components/signals/
β”œβ”€β”€ reproducibility-badge.tsx
β”œβ”€β”€ provenance-badge.tsx          // already partially exists in benchmark-detail.tsx β€” see Β§4.2
β”œβ”€β”€ variant-divergence-badge.tsx
β”œβ”€β”€ cross-party-divergence-badge.tsx
β”œβ”€β”€ reproducibility-panel.tsx      // detail view β€” full missing-fields list
β”œβ”€β”€ completeness-panel.tsx         // detail view β€” score bar + missing-field list
β”œβ”€β”€ comparability-panel.tsx        // detail view β€” divergent groups list
β”œβ”€β”€ signals-row-badges.tsx         // composite: renders all four row-level badges with proper spacing
└── signal-tooltip.tsx             // shared tooltip primitive
```

All badges should follow the existing tone conventions used by `getRelationshipBadgeTone` ([components/benchmark-detail.tsx:289](../components/benchmark-detail.tsx#L289)) and the `Badge` primitive in [components/ui/badge.tsx](../components/ui/badge.tsx).

### 4.1 Row-level badges β€” placement

Insert `<SignalsRowBadges annotations={modelResult.evalcards?.annotations} />` next to the score cell in:

- **Eval detail leaderboard table** β€” [components/eval-detail.tsx:869-871](../components/eval-detail.tsx#L869-L871) (the `<TableCell className="text-right">` containing the score). Render badges below the score on a new line for desktop, hidden on mobile.
- **Benchmark detail rows** β€” `components/benchmark-detail.tsx` renders score rows in several places (search for `formatRawScoreValue`); insert the same component.
- **Model compare dialog** β€” [components/model-compare-dialog.tsx](../components/model-compare-dialog.tsx) score columns.

**Display rules β€” only badge for actionable states.** Silence is meaningful here.

| Signal | Show badge when | Hide when |
|---|---|---|
| Reproducibility | `has_reproducibility_gap === true` | gap=false, or annotation absent |
| Provenance | `source_type` ∈ {`first_party`, `third_party`, `collaborative`} | `source_type === "unspecified"` |
| Variant divergence | `variant_divergence !== null && has_variant_divergence === true` | null (not applicable) or false (checked, fine) |
| Cross-party divergence | `cross_party_divergence !== null && has_cross_party_divergence === true` | null (almost always on current corpus) or false |

`has_*: false` means "we checked and it's fine" β€” silent success. `null` means "not applicable / not enough data" β€” also silent. **Only divergent / gap-positive states warrant pixels.**

**Dedup rule.** `variant_divergence` and `cross_party_divergence` are duplicated onto every row in the same group. If you render three rows from the same `group_id`, render the divergence badge on each row but the *expanded panel* (Β§4.4) only once at the group header.

### 4.2 Provenance badge β€” reuse what's there

[components/benchmark-detail.tsx:262-302](../components/benchmark-detail.tsx#L262-L302) already has `getRelationshipShortLabel` and `getRelationshipBadgeTone`. Extract these into `components/signals/provenance-badge.tsx` and import back into `benchmark-detail.tsx`. The new badge should **also** consume the new `Provenance` annotation when present (it carries `is_multi_source` and `first_party_only`, which the current implementation derives row-by-row from `source_metadata` alone).

When `provenance.first_party_only === true`, show a small ⚠ subtle indicator on the first-party badge ("first-party only β€” no independent replication"). This is the headline use of the signal for policy-mode readers.

### 4.3 Reproducibility badge β€” content rules

Tooltip content depends on audience mode (`useAudienceMode()` from [components/audience-mode-provider.tsx:40](../components/audience-mode-provider.tsx#L40)):

- Research mode: "Setup not fully documented. Missing: `max_tokens`, `eval_plan`."
- Policy mode: "This score's setup isn't fully documented, so it can't be re-run as-is."

Always include the count "{populated_field_count} of {required_field_count} setup fields recorded." Don't hardcode "4 fields" β€” the active runtime checks 2 base fields (`temperature`, `max_tokens`) plus 2 agentic fields (`eval_plan`, `eval_limits`) when the benchmark is agentic. Read counts off the annotation.

### 4.4 Detail panels β€” placement

#### Reproducibility panel
The existing "Evaluation Provenance" panel in [components/eval-detail.tsx:952-998](../components/eval-detail.tsx#L952-L998) (rendered when a row is expanded) is the right place for the **per-row** reproducibility breakdown. Add a new `DetailPanel` adjacent to it:

```tsx
{rowAnnotations?.reproducibility_gap && (
  <DetailPanel
    title={isResearchView ? "Reproducibility" : "Re-runnability"}
    subtitle={
      isResearchView
        ? "Whether the setup is documented well enough for someone else to re-run."
        : "Whether someone could re-run this evaluation with the information available."
    }
  >
    <MetaRow
      label="Setup fields recorded"
      value={`${rowAnnotations.reproducibility_gap.populated_field_count} of ${rowAnnotations.reproducibility_gap.required_field_count}`}
    />
    {rowAnnotations.reproducibility_gap.missing_fields.length > 0 && (
      <MetaRow
        label="Missing"
        value={rowAnnotations.reproducibility_gap.missing_fields.join(", ")}
      />
    )}
  </DetailPanel>
)}
```

#### Completeness panel
Render at the **eval-detail header level** (above the leaderboard, below the metric specification card). New `<CompletenessPanel completeness={detail.evalcards?.annotations?.reporting_completeness} />`. UI: progress bar showing `completeness_score`, label "{N} of {M} fields populated" where N = sum of `field_scores[].score` rounded, M = `total_fields_evaluated`. Below: collapsible accordions:

- **Missing required fields** (count badge) β€” list of `missing_required_fields` with friendly labels (see Β§6.4 for label mapping).
- **Partially populated** (count badge) β€” `partial_fields` rendered as "{field}: {populated_subitems}/{total_subitems}".

In policy mode, don't show the dotted-path field names β€” show friendly labels only. In research mode, show both.

#### Comparability panel
Also at eval-detail header level. Sourced from `detail.evalcards?.annotations?.benchmark_comparability`. Render as two collapsibles β€” "Variant divergence ({count})" and "Cross-party divergence ({count})". Each item should link to the relevant model row (use `model_route_id` from each group entry as anchor β€” add `id={"row-" + model_route_id}` on the leaderboard row).

When both arrays are empty, hide the panel entirely. When `comparability_summary.groups_with_cross_party_check === 0` (the common state), surface a small note: "No third-party reports available for cross-party comparison."

### 4.5 Per-eval header chips
On the eval-detail page header (next to existing "Measures" / "Source dataset" chips around [components/eval-detail.tsx:486-525](../components/eval-detail.tsx#L486-L525)), add a fourth chip when `evalcards.annotations.reporting_completeness` is present:

> **Documentation**
> {round(completeness_score * 100)}%

Tooltip: "{N} of {M} EvalCards documentation fields populated for this benchmark."

### 4.6 Per-model card chips
On `components/eval-card.tsx` and the model card pages, add three chips driven by the model-level summaries. Replace the hand-written hint at [components/eval-card.tsx:250](../components/eval-card.tsx#L250) ("Some results lack generation settings; compare scores with care.") with a data-driven version:

> {has_reproducibility_gap_count} of {results_total} reported scores aren't fully documented.

Show only when `has_reproducibility_gap_count > 0`. The hand-written hint was a placeholder for exactly this signal β€” wire it up.

---

## 5. New page: corpus dashboard

Add `app/corpus/page.tsx` (linked from main navigation [components/navigation.tsx](../components/navigation.tsx)). Server component that calls `fetchCorpusAggregates()` and renders four sections:

### 5.1 Reproducibility section
- Headline number: `reproducibility_gap_rate` rendered as percentage. Sub-label: "{triples_with_reproducibility_gap} of {total_triples} reported scores."
- Per-field horizontal bar chart from `per_field_missingness`. **Bar denominator depends on `denominator` field**: agentic-only fields use `agentic_triples`, others use `total_triples`. Label each bar with the denominator type so users understand.
- Toggle: `overall` ↔ `by_category` (rendered as a small-multiple grid, one panel per category).

### 5.2 Completeness section
- Headline: `completeness_score_mean` (and median) across `total_benchmarks`.
- Histogram of per-benchmark scores (pull individual benchmark scores from `eval-list.json` `reporting_completeness.completeness_score`, since corpus-aggregates only carries mean/median).
- Per-field bar chart from `per_field_population` β€” three bars per field: `mean_score`, `populated_rate`, `fully_populated_rate`. (See Β§6.7 for which one to highlight per coverage type.)

### 5.3 Provenance section
- Stacked bar of `source_type_distribution` (across all triples).
- Two ratios: `multi_source_rate`, `first_party_only_rate`. Label both: "% of (model, benchmark, metric) groups."

### 5.4 Comparability section
- Two side-by-side panels: Variant divergence (eligible-aware rate) and Cross-party divergence (often null).
- **When `cross_party_divergence_rate === null`:** show a "Not enough multi-org coverage to compute" empty state, not "0%". Same for `variant_divergence_rate === null`. This is critical β€” see Β§6.5.

All sections support a category toggle (research mode shows category breakdowns by default; policy mode shows overall by default).

---

## 6. Caveats and edge cases (read these before implementing)

### 6.1 `first_party_only` semantics
A row can be `first_party_only: true` even when `is_multi_source: false`. The spec literal: a group with one *named* org reporting first-party gets the badge. **Don't read it as "exclusive coverage"** β€” read it as "no independent replication." The label suggestion is "First-party only" rather than "Sole source."

If `distinct_reporting_organizations === 0` (all rows have null org), `first_party_only` is `false` even when `source_type === "first_party"`. Render the row's source as "First-party (org unspecified)" in research mode; suppress the first-party-only badge.

### 6.2 Active reproducibility field set is reduced
The spec describes four base fields (`temperature`, `top_p`, `max_tokens`, `prompt_template`); the active backend currently checks **only `temperature` and `max_tokens`** plus `eval_plan` / `eval_limits` for agentic benchmarks. **Don't hardcode "4 fields" anywhere.** Always read `required_field_count` off the annotation. This is a deliberate spec-author choice and may revert; the field count is the only stable interface.

### 6.3 Missing-field path strings
`missing_fields` for reproducibility uses bare names (e.g. `"max_tokens"`). `missing_required_fields` for completeness uses dotted paths (e.g. `"autobenchmarkcard.methodology.baseline_results"`). Different conventions, intentional. Build a small label map for completeness paths β€” paths come from [registry/completeness_fields.json](https://github.com/evaleval/eval_cards_backend_pipeline/blob/main/registry/completeness_fields.json) on the backend repo. Suggested label rules:

- Drop the `autobenchmarkcard.` / `eee_eval.` / `evalcards.` prefix.
- Replace dots with " / ", underscore with space, title-case.
- Example: `autobenchmarkcard.methodology.baseline_results` β†’ "Methodology / Baseline results".

### 6.4 `differing_setup_fields[].values` may contain null and mixed types
Per spec Β§6.1.4, `null` is a *distinct* value from any explicit setting (comparing "explicit 2048" to "unspecified" is meaningful). Render `null` as "(unspecified)" rather than the string "null". Numeric, string, boolean, and object values can all appear in the same array; render with `JSON.stringify` for objects, plain text otherwise.

### 6.5 `null` rates in comparability are *not* zero
Eligibility-aware denominators mean `variant_divergence_rate` and `cross_party_divergence_rate` are `null` when no groups were eligible. **Render as "N/A β€” not enough data" or an empty-state card, never as "0%".** On the current corpus, `cross_party_divergence_rate` will commonly be null (third-party reports are sparse). Treat this as a normal state, not a data-loading error.

### 6.6 Score-scale anomaly flag
`variant_divergence.score_scale_anomaly === true` indicates the metric was declared `proportion` but scores fell outside [0, 1] β€” usually a metric-normalization bug upstream. Surface as a small "data quality warning" annotation alongside the divergence number; the divergence is still computed but the threshold may not be apples-to-apples.

### 6.7 `mean_score` vs `populated_rate` for completeness
Per-field aggregates expose three numbers. Pick which to display based on `coverage_type`:

- **`full` and `reserved` fields** β€” `mean_score` and `populated_rate` are equal. Show one number labeled "% of benchmarks populating this field."
- **`partial` fields** β€” they diverge. `populated_rate` = % of benchmarks with *any* sub-item; `mean_score` = average sub-item population fraction. Show both: "{populated_rate}% have any data, {mean_score}% on average across sub-items."

### 6.8 No `computed_at` on per-record annotations
Only `signal_version` is on each annotation. For "last computed" UI text, use `manifest.json β†’ generated_at` from the existing `BackendManifest`.

### 6.9 Stratification categories
`by_category` keys are: `agentic`, `general`, `knowledge`, `reasoning`, `safety`, `other`. Same set as the existing `category` field on evals β€” reuse whatever color scheme is currently keyed off `inferCategoryFromBenchmark` ([lib/benchmark-schema.ts](../lib/benchmark-schema.ts)).

### 6.10 Annotation block can be `null` or absent
`evalcards.annotations.{reproducibility_gap,provenance,variant_divergence,cross_party_divergence}` can each be `null` independently, and the entire `evalcards` block may be absent on older cached snapshots. Use optional chaining everywhere; never assume presence. The `RowAnnotations` type intentionally types each subfield as `T | null` (not `T | undefined`) because the backend writes explicit `null`.

---

## 7. Suggested implementation order

1. **Types + plumbing** (1–2 hours): types in `backend-artifacts.ts` + `hf-data.ts`, the `fetchCorpusAggregates` fetcher, the API route, and adding `corpus-aggregates.json` to the cache script. No UI yet.
2. **Row-level badges** (Β½ day): build `signals/` directory with the four badge components, the dedup-aware `signals-row-badges.tsx`, and wire into eval-detail and benchmark-detail. This is the most visible win.
3. **Per-eval completeness panel + comparability panel** (Β½ day): single benchmark, easy to design around. New `CompletenessPanel` is the headline new UX in this set.
4. **Per-row reproducibility detail panel** (1–2 hours): drops into the existing expanded row layout.
5. **Per-eval / per-model header chips + replace the hand-written gap hint** (1–2 hours): wires the summary fields into existing card surfaces.
6. **Corpus dashboard page** (1–2 days): new route, new components, biggest scope. Defer until 1–5 are live and reviewed.

Each step is independently shippable. Steps 1–5 can land before the corpus dashboard is designed.

---

## 8. Out of scope (don't do these yet)

- **Filter / sort the eval list by signal state** ("show only benchmarks with completeness > 0.5"). Wait for the dashboard view to land first; users will tell us which filters they actually want.
- **Side-by-side score comparison with divergence overlay.** The data supports it (`scores_in_group`, `scores_by_organization`) but the design space is large. Hold off until we see the row-level badges in use.
- **Recompute / verification UI for missing reproducibility fields.** Backend-side; out of scope here.
- **Per-instance sample-level badges.** Signals operate at row / benchmark level; sample-level instance data is unaffected.

---

## 9. Reference: minimal real-shape examples

Per-row `evalcards.annotations` with all four signals populated:

```jsonc
{
  "reproducibility_gap": {
    "has_reproducibility_gap": true,
    "missing_fields": ["max_tokens"],
    "required_field_count": 2,
    "populated_field_count": 1,
    "signal_version": "1.0"
  },
  "provenance": {
    "source_type": "first_party",
    "is_multi_source": false,
    "first_party_only": true,
    "distinct_reporting_organizations": 1,
    "signal_version": "1.0"
  },
  "variant_divergence": null,
  "cross_party_divergence": null
}
```

Per-eval `evalcards.annotations` with completeness + comparability:

```jsonc
{
  "reporting_completeness": {
    "completeness_score": 0.62,
    "total_fields_evaluated": 28,
    "missing_required_fields": [
      "autobenchmarkcard.methodology.baseline_results",
      "autobenchmarkcard.methodology.validation",
      "evalcards.preregistration_url"
    ],
    "partial_fields": [
      { "field_path": "autobenchmarkcard.data", "score": 0.5, "populated_subitems": 2, "total_subitems": 4 }
    ],
    "field_scores": [/* 28 entries */],
    "signal_version": "1.0"
  },
  "benchmark_comparability": {
    "variant_divergence_groups": [
      {
        "group_id": "openai__gpt-5__hfopenllm_v2_bbh_accuracy",
        "model_route_id": "openai__gpt-5",
        "divergence_magnitude": 0.12,
        "threshold_used": 0.05,
        "threshold_basis": "proportion_or_continuous_normalized",
        "differing_setup_fields": [
          { "field": "max_tokens", "values": [2048, 4096, 8192] }
        ]
      }
    ],
    "cross_party_divergence_groups": []
  }
}
```

Top-level `provenance_summary` example:

```jsonc
{
  "total_results": 142,
  "total_groups": 47,
  "multi_source_groups": 3,
  "first_party_only_groups": 30,
  "source_type_distribution": {
    "first_party": 120,
    "third_party": 18,
    "collaborative": 0,
    "unspecified": 4
  }
}
```

`corpus-aggregates.json` structure (top of file):

```jsonc
{
  "generated_at": "2026-04-27T...",
  "signal_version": "1.0",
  "stratification_dimensions": ["category"],
  "reproducibility": { "overall": {/* ReproducibilityCorpusBlock */}, "by_category": { "agentic": {...}, "general": {...}, ... } },
  "completeness":   { "overall": {/* CompletenessCorpusBlock */},   "by_category": {...} },
  "provenance":     { "overall": {/* ProvenanceCorpusBlock */},     "by_category": {...} },
  "comparability":  { "overall": {/* ComparabilityCorpusBlock */},  "by_category": {...} }
}
```

---

## 10. Audience-mode wording cheatsheet

| Element | Research mode | Policy mode |
|---|---|---|
| Reproducibility gap badge | "Reproducibility gap" | "Setup not documented" |
| Reproducibility tooltip | "Setup not fully documented. Missing: {fields}." | "This score's setup isn't documented, so it can't be re-run as-is." |
| Reproducibility panel title | "Reproducibility" | "Re-runnability" |
| Completeness chip label | "Documentation" | "Documentation" |
| Completeness panel title | "Reporting completeness" | "How well is this benchmark documented?" |
| Provenance: first-party | "1st party" | "Reported by model developer" |
| Provenance: first-party only | "1st party only β€” no replication" | "Only the model developer reported this score" |
| Provenance: third-party | "3rd party" | "Independently reported" |
| Provenance: collaborative | "Collaborative" | "Joint report" |
| Variant divergence badge | "Variant divergence" | "Score depends on setup" |
| Variant divergence tooltip | "Scores diverge by {magnitude} across different setups: {fields}." | "Different runs of this evaluation produced different scores β€” the setup matters." |
| Cross-party divergence badge | "Cross-party divergence" | "Sources disagree" |
| Cross-party divergence tooltip | "Reports diverge by {magnitude} across organizations." | "Different organizations reported different scores for this same model on this same benchmark." |

Adjust tone but keep the underlying numbers identical across modes β€” the data is the same, only the framing changes.

---

*Last updated 2026-04-27. Maintainer: backend pipeline (eval_cards_backend_pipeline), frontend (general-eval-card). Questions on backend semantics β†’ [eval_cards_backend_pipeline#2](https://github.com/evaleval/eval_cards_backend_pipeline/issues/2). Questions on UX β†’ discuss with @anka-evals + frontend team.*