general-eval-card / notes /transformations /12-instance-level-data.md
Jenny Chim
Deploy DuckDB-backed frontend to
da8db3e
|
raw
history blame
16.6 kB
# Per-instance JSONL normalization
Drafted 2026-04-28. Migration item #7 in `notes/migration-plan.md`.
## Framing reminder
We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire β€” pipeline emits a single canonical shape β€” but the parser preserves them as defensive scaffolding for older or different harness output formats.
**Architecture choice (cleaning-only, no SQL involvement):** the inline `instance_examples` preview (≀5 samples per result) gets normalized in pipeline; the per-result `source_url` pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) gets re-wired to a future "show all samples" UI feature rather than deleted.
## Rule (as TS implements it today)
### `parseInstanceLevelData` (`lib/hf-data.ts:933-1043`)
Pure function: takes a JSON object, walks `instance_examples[]`, returns `SampleResult[]`.
Top-level guard:
- If `data` is null/non-object β†’ return `[]`
- If `data.instance_examples` is array β†’ use it
- Else if `data` itself is array β†’ use it
- Else β†’ `[]`
Per-example field extraction (each is a fallback chain):
**`input` (string)** β€” first non-empty wins:
1. `raw.input` is string β†’ use as-is
2. `raw.input.raw` is set β†’ `String(raw.input.raw)`
3. `raw.prompt` β†’ use as-is
4. `raw.question` β†’ use as-is
5. `raw.doc.question` β†’ use as-is
6. `raw.doc` exists β†’ `JSON.stringify(raw.doc).slice(0, 500)`
7. (none) β†’ empty string
**`ground_truth` (string | undefined)** β€” first non-null wins:
1. `raw.input.reference` is array β†’ `array.join(", ")`; else β†’ `String(...)`
2. `raw.ground_truth` β†’ `String(...)`
3. `raw.target` β†’ `String(...)`
4. `raw.gold` β†’ `String(...)`
5. `raw.doc.answer` β†’ `String(...)`
6. (none) β†’ `undefined`
**`response` (string)** β€” first non-empty wins:
1. `raw.output` is set β†’ string-as-is or `JSON.stringify(...)`
2. `raw.response` β†’ use as-is
3. `raw.model_output` β†’ use as-is
4. `raw.answer_attribution` is non-empty array β†’ take last element's `extracted_value` (or empty string)
5. `raw.messages` is non-empty array β†’ reverse + find last assistant message β†’ string content or stringified
6. `raw.filtered_resps[0][0]` β†’ use
7. `raw.resps[0][0]` β†’ use
8. (none) β†’ empty string
**`is_correct` (boolean | undefined)** β€” first defined wins:
1. `raw.evaluation.is_correct` (boolean)
2. `raw.is_correct` (boolean)
3. `raw.metrics.exact_match === 1` β†’ true; `=== 0` β†’ false; else undefined
4. (none) β†’ `undefined`
**`metadata` (object | undefined)** β€” merged from (in order):
- `raw.evaluation` (if object)
- `raw.performance` (if object)
- `raw.metadata` (if object)
- `raw.metrics` (if object)
- If merged object is empty β†’ `undefined`
**`sample_id` (string)** β€” first non-null wins:
1. `raw.sample_id` β†’ as-is
2. `raw.doc_id` β†’ as-is
3. `raw.id` β†’ as-is
4. (none) β†’ `String(arrayIndex)` (positional fallback)
**`choices` (any | undefined)** β€” first non-null wins:
1. `raw.choices`
2. `raw.doc.choices`
3. (none) β†’ `undefined`
If a row's first-pass map returns `null` (i.e. `raw` was null or non-object), it's filtered out via `.filter(s => s !== null)`.
### `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) β€” currently orphaned
Takes a `(url, limit?)`. Fetches the URL via `fetch()`. Splits text on newlines (filters empty lines). For each line up to `limit` (or all if no limit), tries `JSON.parse`; skips malformed lines. Wraps the parsed array as `{ instance_examples: parsed }` and passes to `parseInstanceLevelData`. Returns `SampleResult[]` or `[]` on any error.
**Active call sites: zero.** Verified by grep across `app/`, `components/`, `scripts/`, other `lib/` files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.
The function is intended for "load more / show all samples" UI feature β€” pipeline ships a `source_url` per row pointing to the full JSONL (typically 50 samples), but the inline `instance_examples` preview only carries 5. The fetcher would let UI request the full set on user demand. **This UI feature has never shipped.**
## Classification
- **Unconditional normalization.** The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
- **Cleaning β†’ pipeline.** Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
- **NOT reshape.** Per the architecture choice in the framing note above, samples stay as on-demand fetches via `source_url`; no parquet `instance_samples` table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)
## Inputs and expected outputs
### Group A β€” Pipeline-canonical shape (the only shape that fires in production today)
Input shape (from cache `result.instance_level_data.instance_examples[i]`):
```jsonc
{
"schema_version": "...",
"evaluation_id": "...",
"model_id": "...",
"evaluation_name": "...",
"sample_id": "...",
"sample_hash": "...", // sometimes present
"interaction_type": "multi_turn",
"input": { "raw": "..." }, // ALWAYS object with .raw in production
"output": "..." | { ... }, // sometimes
"messages": [{ role: "...", content: "..." }, ...], // typically present
"answer_attribution": [..., { "extracted_value": "..." }], // typically present
"evaluation": { "is_correct": true|false, ... }, // ALWAYS present
"performance": { ... },
"metadata": { ... },
"token_usage": { ... },
"error": null | "...",
"hierarchy": [...]
}
```
Expected output (`SampleResult`):
| Output field | Source path that fires | Notes |
|---|---|---|
| `sample_id` | `raw.sample_id` (100% of production) | always present |
| `input` | `raw.input.raw` (100%) | branch #2 in the chain |
| `ground_truth` | `raw.input.reference` (100%) | branch #1 |
| `response` | `raw.answer_attribution` (97.31%) OR `raw.messages` (2.49%) OR `raw.output` (0.20%) | branches #4, #5, #1 |
| `is_correct` | `raw.evaluation.is_correct` (100%) | branch #1 |
| `choices` | `undefined` (100%) | no branch fires; field is unset in production |
| `metadata` | merged from `raw.evaluation`, `raw.performance`, `raw.metadata`, `raw.metrics` (always at least 2 of 4 present) | merged object |
### Group B β€” Defensive fallback branches (zero firing rate in current production)
| Branch | Output field | Production hits | Origin (presumed) |
|---|---|---|---|
| `raw.input` (string) | input | 0 | older harness shapes |
| `raw.prompt` | input | 0 | lm-eval-harness |
| `raw.question` | input | 0 | other harnesses |
| `raw.doc.question` | input | 0 | HELM-style |
| `raw.doc` (JSON.stringify) | input | 0 | last-resort |
| `raw.ground_truth` | ground_truth | 0 | older shapes |
| `raw.target` | ground_truth | 0 | classification benchmarks |
| `raw.gold` | ground_truth | 0 | older lm-eval |
| `raw.doc.answer` | ground_truth | 0 | HELM-style |
| `raw.response` | response | 0 | older shapes |
| `raw.model_output` | response | 0 | older shapes |
| `raw.filtered_resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.is_correct` | is_correct | 0 | flat shape |
| `raw.metrics.exact_match` | is_correct | 0 | metric-based correctness |
| `raw.doc_id` | sample_id | 0 | HELM-style |
| `raw.id` | sample_id | 0 | generic |
| index fallback | sample_id | 0 | last-resort |
| `raw.choices` | choices | 0 | multiple-choice |
| `raw.doc.choices` | choices | 0 | HELM multiple-choice |
These branches exist for shapes the pipeline currently does not emit. **Preserve verbatim** until pipeline-side guarantees the canonical shape across all data sources.
### Group C β€” `fetchInstanceLevelData` JSONL parsing edge cases
| Input | Behavior |
|---|---|
| URL returns `!res.ok` (404, 500, etc.) | returns `[]` (no throw) |
| URL throws (network error) | logs warning to console, returns `[]` |
| Empty body | splits to `[]`, returns `[]` |
| Body with empty lines | `.filter(line => line.trim())` strips them |
| Body with malformed JSON line | swallowed in inner try-catch; line skipped, processing continues |
| `limit=0` or `limit=undefined` | parses ALL lines |
| `limit > lines.length` | parses all lines (capped via `Math.min`) |
## Current TS implementation
| Concern | Location | Notes |
|---|---|---|
| `parseInstanceLevelData` | `lib/hf-data.ts:933-1043` | The parser; ~110 lines |
| `fetchInstanceLevelData` | `lib/hf-data.ts:890-917` | URL fetcher; orphaned (zero callers) |
| Active call site | `lib/hf-data.ts:1273` (inside `flattenHierarchyNode`) | `parseInstanceLevelData(result.instance_level_data)` β†’ `inlineSamples` |
| Internal call site | `lib/hf-data.ts:912` | inside `fetchInstanceLevelData` itself, recursive call to the parser |
| `SampleResult` type | `lib/benchmark-schema.ts:135` | Output shape definition |
| Output field | `BenchmarkEvaluation.detailed_evaluation_results_per_samples` | `lib/benchmark-schema.ts:33` |
### Caller chain for `parseInstanceLevelData`
`getModelSummaryById` (lib/model-data.ts:1490+) β†’ `flattenModelEvaluations` β†’ `flattenHierarchyNode` (lib/hf-data.ts:1273) β†’ `parseInstanceLevelData(result.instance_level_data)` β†’ set as `inlineSamples` on each variant bucket β†’ propagated to `BenchmarkEvaluation.detailed_evaluation_results_per_samples`.
UI consumers (read `data.detailed_evaluation_results_per_samples`):
- `components/benchmark-detail.tsx:3869` β€” random sample for preview block
- `components/benchmark-detail.tsx:4174-4216` β€” sample preview UI in benchmark detail
- `components/benchmark-detail.tsx:4982` β€” variant-level sample availability check
- `components/benchmark-detail.tsx:5284-5286` β€” variant sample picker
- `components/benchmark-detail.tsx:5569-5608` β€” sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion
### Caller chain for `fetchInstanceLevelData`
None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.
## Pipeline status
### Side-by-side comparison
| Aspect | TS (this spec) | Pipeline today | Result for users |
|---|---|---|---|
| Inline preview shape | parser handles many variants | emits ONE canonical shape (`input.raw`, `evaluation.is_correct`, etc.) | parser's fallback branches almost all dead |
| URL JSONL shape | same parser handles | emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) | parser would work the same on URL data |
| Inline preview size | parser doesn't care | always exactly 5 samples per `instance_examples` array | UI capped at 5 today |
| Total samples per row | n/a | typically 50 (per `instance_count` field), one outlier 18 | only 10% accessible to UI today |
| URL-fetch use case | `fetchInstanceLevelData` exists | `source_url` always emitted | dead code on TS side; no UI consumer |
### Concrete worked example with quantified scope
Audited 2026-04-28 against `.cache/hf-data/`. Verified by `scripts/verify-instance-level-data.mjs`.
**Prevalence:**
- Total model files: 5,830
- Files with any `instance_level_data`: **55 (0.94%)**
- Total `(metric Γ— model_result)` rows: 86,183
- Result rows with `instance_level_data`: **712 (0.83%)**
- Total inline preview examples (sum of `instance_examples.length`): **3,532** (always ≀5 per row)
- Total full samples (sum of `instance_count`): **66,057** (full set available via `source_url`, not loaded today; ~19Γ— larger than what UI currently shows)
**ild-level shape uniformity (712/712 rows):**
- Top-level keys are always exactly `{interaction_type, instance_count, source_url, instance_examples}`
- `interaction_type` is always `"multi_turn"` (no single_turn samples in cache)
**Per-example branch firing rates (3,532 examples):**
- `input`: `input.raw` 100%
- `ground_truth`: `input.reference` 100%
- `response`: `answer_attribution` 97.31%, `messages` 2.49%, `output` 0.20%
- `is_correct`: `evaluation.is_correct` 100%
- `sample_id`: `sample_id` 100%
- `choices`: nothing (always undefined)
The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are **defensive scaffolding** for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.
**URL JSONL shape verification:** sampled one source_url (`anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...`); first line had identical 18 keys to the inline `instance_examples[0]`, with `input.raw` and `evaluation.is_correct` in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.
## Notes for pipeline implementer
- The pipeline already emits a canonical per-sample shape consistently. **No structural change needed to current emission.** The migration is to make this shape an explicit guarantee, not to change what's being emitted.
- Suggested guarantee: every `instance_examples[i]` (inline AND in JSONL at `source_url`) has at minimum `{sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}`.
- Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract `raw.input.raw`, `raw.input.reference`, `raw.evaluation.is_correct`, `raw.sample_id` as direct field reads. The response field still needs the 3-branch fallback (answer_attribution β†’ messages β†’ output) until pipeline emits a single normalized `response` field.
- **Do NOT pre-ingest the URL JSONL into pipeline parquet** (per the architecture choice). The runtime UI fetches `source_url` on demand; this is the orphaned `fetchInstanceLevelData`'s intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it.
- The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are `multi_turn`) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (`tests/pipeline-contract.test.ts`).
## Migration checklist
- [x] Spec written
- [x] Tests cover each rule branch (`tests/transformations/instance-level-data.test.ts`)
- [x] Audit script (`scripts/verify-instance-level-data.mjs`)
- [ ] Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
- [ ] Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting `every instance_example has input.raw, sample_id, evaluation.is_correct`)
- [ ] TS deleted: `parseInstanceLevelData` shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. `fetchInstanceLevelData` stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.
## Future product decisions (deferred)
- **"Show all samples" UI feature** β€” would un-orphan `fetchInstanceLevelData` and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope.
- **Cross-model sample querying / search / filter** β€” would require `instance_samples.parquet` and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks.
- **Single-turn samples** β€” pipeline currently only emits `interaction_type: multi_turn`. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat `input` strings, `prompt`/`question` fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.