File size: 16,640 Bytes
da8db3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# Per-instance JSONL normalization

Drafted 2026-04-28. Migration item #7 in `notes/migration-plan.md`.

## Framing reminder

We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire β€” pipeline emits a single canonical shape β€” but the parser preserves them as defensive scaffolding for older or different harness output formats.

**Architecture choice (cleaning-only, no SQL involvement):** the inline `instance_examples` preview (≀5 samples per result) gets normalized in pipeline; the per-result `source_url` pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) gets re-wired to a future "show all samples" UI feature rather than deleted.

## Rule (as TS implements it today)

### `parseInstanceLevelData` (`lib/hf-data.ts:933-1043`)

Pure function: takes a JSON object, walks `instance_examples[]`, returns `SampleResult[]`.

Top-level guard:
- If `data` is null/non-object β†’ return `[]`
- If `data.instance_examples` is array β†’ use it
- Else if `data` itself is array β†’ use it
- Else β†’ `[]`

Per-example field extraction (each is a fallback chain):

**`input` (string)** β€” first non-empty wins:
1. `raw.input` is string β†’ use as-is
2. `raw.input.raw` is set β†’ `String(raw.input.raw)`
3. `raw.prompt` β†’ use as-is
4. `raw.question` β†’ use as-is
5. `raw.doc.question` β†’ use as-is
6. `raw.doc` exists β†’ `JSON.stringify(raw.doc).slice(0, 500)`
7. (none) β†’ empty string

**`ground_truth` (string | undefined)** β€” first non-null wins:
1. `raw.input.reference` is array β†’ `array.join(", ")`; else β†’ `String(...)`
2. `raw.ground_truth` β†’ `String(...)`
3. `raw.target` β†’ `String(...)`
4. `raw.gold` β†’ `String(...)`
5. `raw.doc.answer` β†’ `String(...)`
6. (none) β†’ `undefined`

**`response` (string)** β€” first non-empty wins:
1. `raw.output` is set β†’ string-as-is or `JSON.stringify(...)`
2. `raw.response` β†’ use as-is
3. `raw.model_output` β†’ use as-is
4. `raw.answer_attribution` is non-empty array β†’ take last element's `extracted_value` (or empty string)
5. `raw.messages` is non-empty array β†’ reverse + find last assistant message β†’ string content or stringified
6. `raw.filtered_resps[0][0]` β†’ use
7. `raw.resps[0][0]` β†’ use
8. (none) β†’ empty string

**`is_correct` (boolean | undefined)** β€” first defined wins:
1. `raw.evaluation.is_correct` (boolean)
2. `raw.is_correct` (boolean)
3. `raw.metrics.exact_match === 1` β†’ true; `=== 0` β†’ false; else undefined
4. (none) β†’ `undefined`

**`metadata` (object | undefined)** β€” merged from (in order):
- `raw.evaluation` (if object)
- `raw.performance` (if object)
- `raw.metadata` (if object)
- `raw.metrics` (if object)
- If merged object is empty β†’ `undefined`

**`sample_id` (string)** β€” first non-null wins:
1. `raw.sample_id` β†’ as-is
2. `raw.doc_id` β†’ as-is
3. `raw.id` β†’ as-is
4. (none) β†’ `String(arrayIndex)` (positional fallback)

**`choices` (any | undefined)** β€” first non-null wins:
1. `raw.choices`
2. `raw.doc.choices`
3. (none) β†’ `undefined`

If a row's first-pass map returns `null` (i.e. `raw` was null or non-object), it's filtered out via `.filter(s => s !== null)`.

### `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) β€” currently orphaned

Takes a `(url, limit?)`. Fetches the URL via `fetch()`. Splits text on newlines (filters empty lines). For each line up to `limit` (or all if no limit), tries `JSON.parse`; skips malformed lines. Wraps the parsed array as `{ instance_examples: parsed }` and passes to `parseInstanceLevelData`. Returns `SampleResult[]` or `[]` on any error.

**Active call sites: zero.** Verified by grep across `app/`, `components/`, `scripts/`, other `lib/` files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.

The function is intended for "load more / show all samples" UI feature β€” pipeline ships a `source_url` per row pointing to the full JSONL (typically 50 samples), but the inline `instance_examples` preview only carries 5. The fetcher would let UI request the full set on user demand. **This UI feature has never shipped.**

## Classification

- **Unconditional normalization.** The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
- **Cleaning β†’ pipeline.** Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
- **NOT reshape.** Per the architecture choice in the framing note above, samples stay as on-demand fetches via `source_url`; no parquet `instance_samples` table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)

## Inputs and expected outputs

### Group A β€” Pipeline-canonical shape (the only shape that fires in production today)

Input shape (from cache `result.instance_level_data.instance_examples[i]`):

```jsonc
{
  "schema_version": "...",
  "evaluation_id": "...",
  "model_id": "...",
  "evaluation_name": "...",
  "sample_id": "...",
  "sample_hash": "...",            // sometimes present
  "interaction_type": "multi_turn",
  "input": { "raw": "..." },        // ALWAYS object with .raw in production
  "output": "..." | { ... },        // sometimes
  "messages": [{ role: "...", content: "..." }, ...],  // typically present
  "answer_attribution": [..., { "extracted_value": "..." }],  // typically present
  "evaluation": { "is_correct": true|false, ... },  // ALWAYS present
  "performance": { ... },
  "metadata": { ... },
  "token_usage": { ... },
  "error": null | "...",
  "hierarchy": [...]
}
```

Expected output (`SampleResult`):

| Output field | Source path that fires | Notes |
|---|---|---|
| `sample_id` | `raw.sample_id` (100% of production) | always present |
| `input` | `raw.input.raw` (100%) | branch #2 in the chain |
| `ground_truth` | `raw.input.reference` (100%) | branch #1 |
| `response` | `raw.answer_attribution` (97.31%) OR `raw.messages` (2.49%) OR `raw.output` (0.20%) | branches #4, #5, #1 |
| `is_correct` | `raw.evaluation.is_correct` (100%) | branch #1 |
| `choices` | `undefined` (100%) | no branch fires; field is unset in production |
| `metadata` | merged from `raw.evaluation`, `raw.performance`, `raw.metadata`, `raw.metrics` (always at least 2 of 4 present) | merged object |

### Group B β€” Defensive fallback branches (zero firing rate in current production)

| Branch | Output field | Production hits | Origin (presumed) |
|---|---|---|---|
| `raw.input` (string) | input | 0 | older harness shapes |
| `raw.prompt` | input | 0 | lm-eval-harness |
| `raw.question` | input | 0 | other harnesses |
| `raw.doc.question` | input | 0 | HELM-style |
| `raw.doc` (JSON.stringify) | input | 0 | last-resort |
| `raw.ground_truth` | ground_truth | 0 | older shapes |
| `raw.target` | ground_truth | 0 | classification benchmarks |
| `raw.gold` | ground_truth | 0 | older lm-eval |
| `raw.doc.answer` | ground_truth | 0 | HELM-style |
| `raw.response` | response | 0 | older shapes |
| `raw.model_output` | response | 0 | older shapes |
| `raw.filtered_resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.is_correct` | is_correct | 0 | flat shape |
| `raw.metrics.exact_match` | is_correct | 0 | metric-based correctness |
| `raw.doc_id` | sample_id | 0 | HELM-style |
| `raw.id` | sample_id | 0 | generic |
| index fallback | sample_id | 0 | last-resort |
| `raw.choices` | choices | 0 | multiple-choice |
| `raw.doc.choices` | choices | 0 | HELM multiple-choice |

These branches exist for shapes the pipeline currently does not emit. **Preserve verbatim** until pipeline-side guarantees the canonical shape across all data sources.

### Group C β€” `fetchInstanceLevelData` JSONL parsing edge cases

| Input | Behavior |
|---|---|
| URL returns `!res.ok` (404, 500, etc.) | returns `[]` (no throw) |
| URL throws (network error) | logs warning to console, returns `[]` |
| Empty body | splits to `[]`, returns `[]` |
| Body with empty lines | `.filter(line => line.trim())` strips them |
| Body with malformed JSON line | swallowed in inner try-catch; line skipped, processing continues |
| `limit=0` or `limit=undefined` | parses ALL lines |
| `limit > lines.length` | parses all lines (capped via `Math.min`) |

## Current TS implementation

| Concern | Location | Notes |
|---|---|---|
| `parseInstanceLevelData` | `lib/hf-data.ts:933-1043` | The parser; ~110 lines |
| `fetchInstanceLevelData` | `lib/hf-data.ts:890-917` | URL fetcher; orphaned (zero callers) |
| Active call site | `lib/hf-data.ts:1273` (inside `flattenHierarchyNode`) | `parseInstanceLevelData(result.instance_level_data)` β†’ `inlineSamples` |
| Internal call site | `lib/hf-data.ts:912` | inside `fetchInstanceLevelData` itself, recursive call to the parser |
| `SampleResult` type | `lib/benchmark-schema.ts:135` | Output shape definition |
| Output field | `BenchmarkEvaluation.detailed_evaluation_results_per_samples` | `lib/benchmark-schema.ts:33` |

### Caller chain for `parseInstanceLevelData`

`getModelSummaryById` (lib/model-data.ts:1490+) β†’ `flattenModelEvaluations` β†’ `flattenHierarchyNode` (lib/hf-data.ts:1273) β†’ `parseInstanceLevelData(result.instance_level_data)` β†’ set as `inlineSamples` on each variant bucket β†’ propagated to `BenchmarkEvaluation.detailed_evaluation_results_per_samples`.

UI consumers (read `data.detailed_evaluation_results_per_samples`):
- `components/benchmark-detail.tsx:3869` β€” random sample for preview block
- `components/benchmark-detail.tsx:4174-4216` β€” sample preview UI in benchmark detail
- `components/benchmark-detail.tsx:4982` β€” variant-level sample availability check
- `components/benchmark-detail.tsx:5284-5286` β€” variant sample picker
- `components/benchmark-detail.tsx:5569-5608` β€” sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion

### Caller chain for `fetchInstanceLevelData`

None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.

## Pipeline status

### Side-by-side comparison

| Aspect | TS (this spec) | Pipeline today | Result for users |
|---|---|---|---|
| Inline preview shape | parser handles many variants | emits ONE canonical shape (`input.raw`, `evaluation.is_correct`, etc.) | parser's fallback branches almost all dead |
| URL JSONL shape | same parser handles | emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) | parser would work the same on URL data |
| Inline preview size | parser doesn't care | always exactly 5 samples per `instance_examples` array | UI capped at 5 today |
| Total samples per row | n/a | typically 50 (per `instance_count` field), one outlier 18 | only 10% accessible to UI today |
| URL-fetch use case | `fetchInstanceLevelData` exists | `source_url` always emitted | dead code on TS side; no UI consumer |

### Concrete worked example with quantified scope

Audited 2026-04-28 against `.cache/hf-data/`. Verified by `scripts/verify-instance-level-data.mjs`.

**Prevalence:**
- Total model files: 5,830
- Files with any `instance_level_data`: **55 (0.94%)**
- Total `(metric Γ— model_result)` rows: 86,183
- Result rows with `instance_level_data`: **712 (0.83%)**
- Total inline preview examples (sum of `instance_examples.length`): **3,532** (always ≀5 per row)
- Total full samples (sum of `instance_count`): **66,057** (full set available via `source_url`, not loaded today; ~19Γ— larger than what UI currently shows)

**ild-level shape uniformity (712/712 rows):**
- Top-level keys are always exactly `{interaction_type, instance_count, source_url, instance_examples}`
- `interaction_type` is always `"multi_turn"` (no single_turn samples in cache)

**Per-example branch firing rates (3,532 examples):**
- `input`: `input.raw` 100%
- `ground_truth`: `input.reference` 100%
- `response`: `answer_attribution` 97.31%, `messages` 2.49%, `output` 0.20%
- `is_correct`: `evaluation.is_correct` 100%
- `sample_id`: `sample_id` 100%
- `choices`: nothing (always undefined)

The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are **defensive scaffolding** for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.

**URL JSONL shape verification:** sampled one source_url (`anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...`); first line had identical 18 keys to the inline `instance_examples[0]`, with `input.raw` and `evaluation.is_correct` in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.

## Notes for pipeline implementer

- The pipeline already emits a canonical per-sample shape consistently. **No structural change needed to current emission.** The migration is to make this shape an explicit guarantee, not to change what's being emitted.
- Suggested guarantee: every `instance_examples[i]` (inline AND in JSONL at `source_url`) has at minimum `{sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}`.
- Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract `raw.input.raw`, `raw.input.reference`, `raw.evaluation.is_correct`, `raw.sample_id` as direct field reads. The response field still needs the 3-branch fallback (answer_attribution β†’ messages β†’ output) until pipeline emits a single normalized `response` field.
- **Do NOT pre-ingest the URL JSONL into pipeline parquet** (per the architecture choice). The runtime UI fetches `source_url` on demand; this is the orphaned `fetchInstanceLevelData`'s intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it.
- The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are `multi_turn`) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (`tests/pipeline-contract.test.ts`).

## Migration checklist

- [x] Spec written
- [x] Tests cover each rule branch (`tests/transformations/instance-level-data.test.ts`)
- [x] Audit script (`scripts/verify-instance-level-data.mjs`)
- [ ] Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
- [ ] Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting `every instance_example has input.raw, sample_id, evaluation.is_correct`)
- [ ] TS deleted: `parseInstanceLevelData` shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. `fetchInstanceLevelData` stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.

## Future product decisions (deferred)

- **"Show all samples" UI feature** β€” would un-orphan `fetchInstanceLevelData` and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope.
- **Cross-model sample querying / search / filter** β€” would require `instance_samples.parquet` and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks.
- **Single-turn samples** β€” pipeline currently only emits `interaction_type: multi_turn`. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat `input` strings, `prompt`/`question` fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.