Spaces:
Running
Running
File size: 16,640 Bytes
da8db3e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | # Per-instance JSONL normalization
Drafted 2026-04-28. Migration item #7 in `notes/migration-plan.md`.
## Framing reminder
We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire β pipeline emits a single canonical shape β but the parser preserves them as defensive scaffolding for older or different harness output formats.
**Architecture choice (cleaning-only, no SQL involvement):** the inline `instance_examples` preview (β€5 samples per result) gets normalized in pipeline; the per-result `source_url` pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) gets re-wired to a future "show all samples" UI feature rather than deleted.
## Rule (as TS implements it today)
### `parseInstanceLevelData` (`lib/hf-data.ts:933-1043`)
Pure function: takes a JSON object, walks `instance_examples[]`, returns `SampleResult[]`.
Top-level guard:
- If `data` is null/non-object β return `[]`
- If `data.instance_examples` is array β use it
- Else if `data` itself is array β use it
- Else β `[]`
Per-example field extraction (each is a fallback chain):
**`input` (string)** β first non-empty wins:
1. `raw.input` is string β use as-is
2. `raw.input.raw` is set β `String(raw.input.raw)`
3. `raw.prompt` β use as-is
4. `raw.question` β use as-is
5. `raw.doc.question` β use as-is
6. `raw.doc` exists β `JSON.stringify(raw.doc).slice(0, 500)`
7. (none) β empty string
**`ground_truth` (string | undefined)** β first non-null wins:
1. `raw.input.reference` is array β `array.join(", ")`; else β `String(...)`
2. `raw.ground_truth` β `String(...)`
3. `raw.target` β `String(...)`
4. `raw.gold` β `String(...)`
5. `raw.doc.answer` β `String(...)`
6. (none) β `undefined`
**`response` (string)** β first non-empty wins:
1. `raw.output` is set β string-as-is or `JSON.stringify(...)`
2. `raw.response` β use as-is
3. `raw.model_output` β use as-is
4. `raw.answer_attribution` is non-empty array β take last element's `extracted_value` (or empty string)
5. `raw.messages` is non-empty array β reverse + find last assistant message β string content or stringified
6. `raw.filtered_resps[0][0]` β use
7. `raw.resps[0][0]` β use
8. (none) β empty string
**`is_correct` (boolean | undefined)** β first defined wins:
1. `raw.evaluation.is_correct` (boolean)
2. `raw.is_correct` (boolean)
3. `raw.metrics.exact_match === 1` β true; `=== 0` β false; else undefined
4. (none) β `undefined`
**`metadata` (object | undefined)** β merged from (in order):
- `raw.evaluation` (if object)
- `raw.performance` (if object)
- `raw.metadata` (if object)
- `raw.metrics` (if object)
- If merged object is empty β `undefined`
**`sample_id` (string)** β first non-null wins:
1. `raw.sample_id` β as-is
2. `raw.doc_id` β as-is
3. `raw.id` β as-is
4. (none) β `String(arrayIndex)` (positional fallback)
**`choices` (any | undefined)** β first non-null wins:
1. `raw.choices`
2. `raw.doc.choices`
3. (none) β `undefined`
If a row's first-pass map returns `null` (i.e. `raw` was null or non-object), it's filtered out via `.filter(s => s !== null)`.
### `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) β currently orphaned
Takes a `(url, limit?)`. Fetches the URL via `fetch()`. Splits text on newlines (filters empty lines). For each line up to `limit` (or all if no limit), tries `JSON.parse`; skips malformed lines. Wraps the parsed array as `{ instance_examples: parsed }` and passes to `parseInstanceLevelData`. Returns `SampleResult[]` or `[]` on any error.
**Active call sites: zero.** Verified by grep across `app/`, `components/`, `scripts/`, other `lib/` files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.
The function is intended for "load more / show all samples" UI feature β pipeline ships a `source_url` per row pointing to the full JSONL (typically 50 samples), but the inline `instance_examples` preview only carries 5. The fetcher would let UI request the full set on user demand. **This UI feature has never shipped.**
## Classification
- **Unconditional normalization.** The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
- **Cleaning β pipeline.** Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
- **NOT reshape.** Per the architecture choice in the framing note above, samples stay as on-demand fetches via `source_url`; no parquet `instance_samples` table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)
## Inputs and expected outputs
### Group A β Pipeline-canonical shape (the only shape that fires in production today)
Input shape (from cache `result.instance_level_data.instance_examples[i]`):
```jsonc
{
"schema_version": "...",
"evaluation_id": "...",
"model_id": "...",
"evaluation_name": "...",
"sample_id": "...",
"sample_hash": "...", // sometimes present
"interaction_type": "multi_turn",
"input": { "raw": "..." }, // ALWAYS object with .raw in production
"output": "..." | { ... }, // sometimes
"messages": [{ role: "...", content: "..." }, ...], // typically present
"answer_attribution": [..., { "extracted_value": "..." }], // typically present
"evaluation": { "is_correct": true|false, ... }, // ALWAYS present
"performance": { ... },
"metadata": { ... },
"token_usage": { ... },
"error": null | "...",
"hierarchy": [...]
}
```
Expected output (`SampleResult`):
| Output field | Source path that fires | Notes |
|---|---|---|
| `sample_id` | `raw.sample_id` (100% of production) | always present |
| `input` | `raw.input.raw` (100%) | branch #2 in the chain |
| `ground_truth` | `raw.input.reference` (100%) | branch #1 |
| `response` | `raw.answer_attribution` (97.31%) OR `raw.messages` (2.49%) OR `raw.output` (0.20%) | branches #4, #5, #1 |
| `is_correct` | `raw.evaluation.is_correct` (100%) | branch #1 |
| `choices` | `undefined` (100%) | no branch fires; field is unset in production |
| `metadata` | merged from `raw.evaluation`, `raw.performance`, `raw.metadata`, `raw.metrics` (always at least 2 of 4 present) | merged object |
### Group B β Defensive fallback branches (zero firing rate in current production)
| Branch | Output field | Production hits | Origin (presumed) |
|---|---|---|---|
| `raw.input` (string) | input | 0 | older harness shapes |
| `raw.prompt` | input | 0 | lm-eval-harness |
| `raw.question` | input | 0 | other harnesses |
| `raw.doc.question` | input | 0 | HELM-style |
| `raw.doc` (JSON.stringify) | input | 0 | last-resort |
| `raw.ground_truth` | ground_truth | 0 | older shapes |
| `raw.target` | ground_truth | 0 | classification benchmarks |
| `raw.gold` | ground_truth | 0 | older lm-eval |
| `raw.doc.answer` | ground_truth | 0 | HELM-style |
| `raw.response` | response | 0 | older shapes |
| `raw.model_output` | response | 0 | older shapes |
| `raw.filtered_resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.is_correct` | is_correct | 0 | flat shape |
| `raw.metrics.exact_match` | is_correct | 0 | metric-based correctness |
| `raw.doc_id` | sample_id | 0 | HELM-style |
| `raw.id` | sample_id | 0 | generic |
| index fallback | sample_id | 0 | last-resort |
| `raw.choices` | choices | 0 | multiple-choice |
| `raw.doc.choices` | choices | 0 | HELM multiple-choice |
These branches exist for shapes the pipeline currently does not emit. **Preserve verbatim** until pipeline-side guarantees the canonical shape across all data sources.
### Group C β `fetchInstanceLevelData` JSONL parsing edge cases
| Input | Behavior |
|---|---|
| URL returns `!res.ok` (404, 500, etc.) | returns `[]` (no throw) |
| URL throws (network error) | logs warning to console, returns `[]` |
| Empty body | splits to `[]`, returns `[]` |
| Body with empty lines | `.filter(line => line.trim())` strips them |
| Body with malformed JSON line | swallowed in inner try-catch; line skipped, processing continues |
| `limit=0` or `limit=undefined` | parses ALL lines |
| `limit > lines.length` | parses all lines (capped via `Math.min`) |
## Current TS implementation
| Concern | Location | Notes |
|---|---|---|
| `parseInstanceLevelData` | `lib/hf-data.ts:933-1043` | The parser; ~110 lines |
| `fetchInstanceLevelData` | `lib/hf-data.ts:890-917` | URL fetcher; orphaned (zero callers) |
| Active call site | `lib/hf-data.ts:1273` (inside `flattenHierarchyNode`) | `parseInstanceLevelData(result.instance_level_data)` β `inlineSamples` |
| Internal call site | `lib/hf-data.ts:912` | inside `fetchInstanceLevelData` itself, recursive call to the parser |
| `SampleResult` type | `lib/benchmark-schema.ts:135` | Output shape definition |
| Output field | `BenchmarkEvaluation.detailed_evaluation_results_per_samples` | `lib/benchmark-schema.ts:33` |
### Caller chain for `parseInstanceLevelData`
`getModelSummaryById` (lib/model-data.ts:1490+) β `flattenModelEvaluations` β `flattenHierarchyNode` (lib/hf-data.ts:1273) β `parseInstanceLevelData(result.instance_level_data)` β set as `inlineSamples` on each variant bucket β propagated to `BenchmarkEvaluation.detailed_evaluation_results_per_samples`.
UI consumers (read `data.detailed_evaluation_results_per_samples`):
- `components/benchmark-detail.tsx:3869` β random sample for preview block
- `components/benchmark-detail.tsx:4174-4216` β sample preview UI in benchmark detail
- `components/benchmark-detail.tsx:4982` β variant-level sample availability check
- `components/benchmark-detail.tsx:5284-5286` β variant sample picker
- `components/benchmark-detail.tsx:5569-5608` β sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion
### Caller chain for `fetchInstanceLevelData`
None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.
## Pipeline status
### Side-by-side comparison
| Aspect | TS (this spec) | Pipeline today | Result for users |
|---|---|---|---|
| Inline preview shape | parser handles many variants | emits ONE canonical shape (`input.raw`, `evaluation.is_correct`, etc.) | parser's fallback branches almost all dead |
| URL JSONL shape | same parser handles | emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) | parser would work the same on URL data |
| Inline preview size | parser doesn't care | always exactly 5 samples per `instance_examples` array | UI capped at 5 today |
| Total samples per row | n/a | typically 50 (per `instance_count` field), one outlier 18 | only 10% accessible to UI today |
| URL-fetch use case | `fetchInstanceLevelData` exists | `source_url` always emitted | dead code on TS side; no UI consumer |
### Concrete worked example with quantified scope
Audited 2026-04-28 against `.cache/hf-data/`. Verified by `scripts/verify-instance-level-data.mjs`.
**Prevalence:**
- Total model files: 5,830
- Files with any `instance_level_data`: **55 (0.94%)**
- Total `(metric Γ model_result)` rows: 86,183
- Result rows with `instance_level_data`: **712 (0.83%)**
- Total inline preview examples (sum of `instance_examples.length`): **3,532** (always β€5 per row)
- Total full samples (sum of `instance_count`): **66,057** (full set available via `source_url`, not loaded today; ~19Γ larger than what UI currently shows)
**ild-level shape uniformity (712/712 rows):**
- Top-level keys are always exactly `{interaction_type, instance_count, source_url, instance_examples}`
- `interaction_type` is always `"multi_turn"` (no single_turn samples in cache)
**Per-example branch firing rates (3,532 examples):**
- `input`: `input.raw` 100%
- `ground_truth`: `input.reference` 100%
- `response`: `answer_attribution` 97.31%, `messages` 2.49%, `output` 0.20%
- `is_correct`: `evaluation.is_correct` 100%
- `sample_id`: `sample_id` 100%
- `choices`: nothing (always undefined)
The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are **defensive scaffolding** for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.
**URL JSONL shape verification:** sampled one source_url (`anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...`); first line had identical 18 keys to the inline `instance_examples[0]`, with `input.raw` and `evaluation.is_correct` in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.
## Notes for pipeline implementer
- The pipeline already emits a canonical per-sample shape consistently. **No structural change needed to current emission.** The migration is to make this shape an explicit guarantee, not to change what's being emitted.
- Suggested guarantee: every `instance_examples[i]` (inline AND in JSONL at `source_url`) has at minimum `{sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}`.
- Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract `raw.input.raw`, `raw.input.reference`, `raw.evaluation.is_correct`, `raw.sample_id` as direct field reads. The response field still needs the 3-branch fallback (answer_attribution β messages β output) until pipeline emits a single normalized `response` field.
- **Do NOT pre-ingest the URL JSONL into pipeline parquet** (per the architecture choice). The runtime UI fetches `source_url` on demand; this is the orphaned `fetchInstanceLevelData`'s intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it.
- The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are `multi_turn`) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (`tests/pipeline-contract.test.ts`).
## Migration checklist
- [x] Spec written
- [x] Tests cover each rule branch (`tests/transformations/instance-level-data.test.ts`)
- [x] Audit script (`scripts/verify-instance-level-data.mjs`)
- [ ] Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
- [ ] Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting `every instance_example has input.raw, sample_id, evaluation.is_correct`)
- [ ] TS deleted: `parseInstanceLevelData` shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. `fetchInstanceLevelData` stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.
## Future product decisions (deferred)
- **"Show all samples" UI feature** β would un-orphan `fetchInstanceLevelData` and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope.
- **Cross-model sample querying / search / filter** β would require `instance_samples.parquet` and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks.
- **Single-turn samples** β pipeline currently only emits `interaction_type: multi_turn`. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat `input` strings, `prompt`/`question` fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.
|