Spaces:
Running
Running
| # Per-instance JSONL normalization | |
| Drafted 2026-04-28. Migration item #7 in `notes/migration-plan.md`. | |
| ## Framing reminder | |
| We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire β pipeline emits a single canonical shape β but the parser preserves them as defensive scaffolding for older or different harness output formats. | |
| **Architecture choice (cleaning-only, no SQL involvement):** the inline `instance_examples` preview (β€5 samples per result) gets normalized in pipeline; the per-result `source_url` pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) gets re-wired to a future "show all samples" UI feature rather than deleted. | |
| ## Rule (as TS implements it today) | |
| ### `parseInstanceLevelData` (`lib/hf-data.ts:933-1043`) | |
| Pure function: takes a JSON object, walks `instance_examples[]`, returns `SampleResult[]`. | |
| Top-level guard: | |
| - If `data` is null/non-object β return `[]` | |
| - If `data.instance_examples` is array β use it | |
| - Else if `data` itself is array β use it | |
| - Else β `[]` | |
| Per-example field extraction (each is a fallback chain): | |
| **`input` (string)** β first non-empty wins: | |
| 1. `raw.input` is string β use as-is | |
| 2. `raw.input.raw` is set β `String(raw.input.raw)` | |
| 3. `raw.prompt` β use as-is | |
| 4. `raw.question` β use as-is | |
| 5. `raw.doc.question` β use as-is | |
| 6. `raw.doc` exists β `JSON.stringify(raw.doc).slice(0, 500)` | |
| 7. (none) β empty string | |
| **`ground_truth` (string | undefined)** β first non-null wins: | |
| 1. `raw.input.reference` is array β `array.join(", ")`; else β `String(...)` | |
| 2. `raw.ground_truth` β `String(...)` | |
| 3. `raw.target` β `String(...)` | |
| 4. `raw.gold` β `String(...)` | |
| 5. `raw.doc.answer` β `String(...)` | |
| 6. (none) β `undefined` | |
| **`response` (string)** β first non-empty wins: | |
| 1. `raw.output` is set β string-as-is or `JSON.stringify(...)` | |
| 2. `raw.response` β use as-is | |
| 3. `raw.model_output` β use as-is | |
| 4. `raw.answer_attribution` is non-empty array β take last element's `extracted_value` (or empty string) | |
| 5. `raw.messages` is non-empty array β reverse + find last assistant message β string content or stringified | |
| 6. `raw.filtered_resps[0][0]` β use | |
| 7. `raw.resps[0][0]` β use | |
| 8. (none) β empty string | |
| **`is_correct` (boolean | undefined)** β first defined wins: | |
| 1. `raw.evaluation.is_correct` (boolean) | |
| 2. `raw.is_correct` (boolean) | |
| 3. `raw.metrics.exact_match === 1` β true; `=== 0` β false; else undefined | |
| 4. (none) β `undefined` | |
| **`metadata` (object | undefined)** β merged from (in order): | |
| - `raw.evaluation` (if object) | |
| - `raw.performance` (if object) | |
| - `raw.metadata` (if object) | |
| - `raw.metrics` (if object) | |
| - If merged object is empty β `undefined` | |
| **`sample_id` (string)** β first non-null wins: | |
| 1. `raw.sample_id` β as-is | |
| 2. `raw.doc_id` β as-is | |
| 3. `raw.id` β as-is | |
| 4. (none) β `String(arrayIndex)` (positional fallback) | |
| **`choices` (any | undefined)** β first non-null wins: | |
| 1. `raw.choices` | |
| 2. `raw.doc.choices` | |
| 3. (none) β `undefined` | |
| If a row's first-pass map returns `null` (i.e. `raw` was null or non-object), it's filtered out via `.filter(s => s !== null)`. | |
| ### `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) β currently orphaned | |
| Takes a `(url, limit?)`. Fetches the URL via `fetch()`. Splits text on newlines (filters empty lines). For each line up to `limit` (or all if no limit), tries `JSON.parse`; skips malformed lines. Wraps the parsed array as `{ instance_examples: parsed }` and passes to `parseInstanceLevelData`. Returns `SampleResult[]` or `[]` on any error. | |
| **Active call sites: zero.** Verified by grep across `app/`, `components/`, `scripts/`, other `lib/` files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history. | |
| The function is intended for "load more / show all samples" UI feature β pipeline ships a `source_url` per row pointing to the full JSONL (typically 50 samples), but the inline `instance_examples` preview only carries 5. The fetcher would let UI request the full set on user demand. **This UI feature has never shipped.** | |
| ## Classification | |
| - **Unconditional normalization.** The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary. | |
| - **Cleaning β pipeline.** Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely. | |
| - **NOT reshape.** Per the architecture choice in the framing note above, samples stay as on-demand fetches via `source_url`; no parquet `instance_samples` table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.) | |
| ## Inputs and expected outputs | |
| ### Group A β Pipeline-canonical shape (the only shape that fires in production today) | |
| Input shape (from cache `result.instance_level_data.instance_examples[i]`): | |
| ```jsonc | |
| { | |
| "schema_version": "...", | |
| "evaluation_id": "...", | |
| "model_id": "...", | |
| "evaluation_name": "...", | |
| "sample_id": "...", | |
| "sample_hash": "...", // sometimes present | |
| "interaction_type": "multi_turn", | |
| "input": { "raw": "..." }, // ALWAYS object with .raw in production | |
| "output": "..." | { ... }, // sometimes | |
| "messages": [{ role: "...", content: "..." }, ...], // typically present | |
| "answer_attribution": [..., { "extracted_value": "..." }], // typically present | |
| "evaluation": { "is_correct": true|false, ... }, // ALWAYS present | |
| "performance": { ... }, | |
| "metadata": { ... }, | |
| "token_usage": { ... }, | |
| "error": null | "...", | |
| "hierarchy": [...] | |
| } | |
| ``` | |
| Expected output (`SampleResult`): | |
| | Output field | Source path that fires | Notes | | |
| |---|---|---| | |
| | `sample_id` | `raw.sample_id` (100% of production) | always present | | |
| | `input` | `raw.input.raw` (100%) | branch #2 in the chain | | |
| | `ground_truth` | `raw.input.reference` (100%) | branch #1 | | |
| | `response` | `raw.answer_attribution` (97.31%) OR `raw.messages` (2.49%) OR `raw.output` (0.20%) | branches #4, #5, #1 | | |
| | `is_correct` | `raw.evaluation.is_correct` (100%) | branch #1 | | |
| | `choices` | `undefined` (100%) | no branch fires; field is unset in production | | |
| | `metadata` | merged from `raw.evaluation`, `raw.performance`, `raw.metadata`, `raw.metrics` (always at least 2 of 4 present) | merged object | | |
| ### Group B β Defensive fallback branches (zero firing rate in current production) | |
| | Branch | Output field | Production hits | Origin (presumed) | | |
| |---|---|---|---| | |
| | `raw.input` (string) | input | 0 | older harness shapes | | |
| | `raw.prompt` | input | 0 | lm-eval-harness | | |
| | `raw.question` | input | 0 | other harnesses | | |
| | `raw.doc.question` | input | 0 | HELM-style | | |
| | `raw.doc` (JSON.stringify) | input | 0 | last-resort | | |
| | `raw.ground_truth` | ground_truth | 0 | older shapes | | |
| | `raw.target` | ground_truth | 0 | classification benchmarks | | |
| | `raw.gold` | ground_truth | 0 | older lm-eval | | |
| | `raw.doc.answer` | ground_truth | 0 | HELM-style | | |
| | `raw.response` | response | 0 | older shapes | | |
| | `raw.model_output` | response | 0 | older shapes | | |
| | `raw.filtered_resps[0][0]` | response | 0 | lm-eval-harness format | | |
| | `raw.resps[0][0]` | response | 0 | lm-eval-harness format | | |
| | `raw.is_correct` | is_correct | 0 | flat shape | | |
| | `raw.metrics.exact_match` | is_correct | 0 | metric-based correctness | | |
| | `raw.doc_id` | sample_id | 0 | HELM-style | | |
| | `raw.id` | sample_id | 0 | generic | | |
| | index fallback | sample_id | 0 | last-resort | | |
| | `raw.choices` | choices | 0 | multiple-choice | | |
| | `raw.doc.choices` | choices | 0 | HELM multiple-choice | | |
| These branches exist for shapes the pipeline currently does not emit. **Preserve verbatim** until pipeline-side guarantees the canonical shape across all data sources. | |
| ### Group C β `fetchInstanceLevelData` JSONL parsing edge cases | |
| | Input | Behavior | | |
| |---|---| | |
| | URL returns `!res.ok` (404, 500, etc.) | returns `[]` (no throw) | | |
| | URL throws (network error) | logs warning to console, returns `[]` | | |
| | Empty body | splits to `[]`, returns `[]` | | |
| | Body with empty lines | `.filter(line => line.trim())` strips them | | |
| | Body with malformed JSON line | swallowed in inner try-catch; line skipped, processing continues | | |
| | `limit=0` or `limit=undefined` | parses ALL lines | | |
| | `limit > lines.length` | parses all lines (capped via `Math.min`) | | |
| ## Current TS implementation | |
| | Concern | Location | Notes | | |
| |---|---|---| | |
| | `parseInstanceLevelData` | `lib/hf-data.ts:933-1043` | The parser; ~110 lines | | |
| | `fetchInstanceLevelData` | `lib/hf-data.ts:890-917` | URL fetcher; orphaned (zero callers) | | |
| | Active call site | `lib/hf-data.ts:1273` (inside `flattenHierarchyNode`) | `parseInstanceLevelData(result.instance_level_data)` β `inlineSamples` | | |
| | Internal call site | `lib/hf-data.ts:912` | inside `fetchInstanceLevelData` itself, recursive call to the parser | | |
| | `SampleResult` type | `lib/benchmark-schema.ts:135` | Output shape definition | | |
| | Output field | `BenchmarkEvaluation.detailed_evaluation_results_per_samples` | `lib/benchmark-schema.ts:33` | | |
| ### Caller chain for `parseInstanceLevelData` | |
| `getModelSummaryById` (lib/model-data.ts:1490+) β `flattenModelEvaluations` β `flattenHierarchyNode` (lib/hf-data.ts:1273) β `parseInstanceLevelData(result.instance_level_data)` β set as `inlineSamples` on each variant bucket β propagated to `BenchmarkEvaluation.detailed_evaluation_results_per_samples`. | |
| UI consumers (read `data.detailed_evaluation_results_per_samples`): | |
| - `components/benchmark-detail.tsx:3869` β random sample for preview block | |
| - `components/benchmark-detail.tsx:4174-4216` β sample preview UI in benchmark detail | |
| - `components/benchmark-detail.tsx:4982` β variant-level sample availability check | |
| - `components/benchmark-detail.tsx:5284-5286` β variant sample picker | |
| - `components/benchmark-detail.tsx:5569-5608` β sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion | |
| ### Caller chain for `fetchInstanceLevelData` | |
| None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it. | |
| ## Pipeline status | |
| ### Side-by-side comparison | |
| | Aspect | TS (this spec) | Pipeline today | Result for users | | |
| |---|---|---|---| | |
| | Inline preview shape | parser handles many variants | emits ONE canonical shape (`input.raw`, `evaluation.is_correct`, etc.) | parser's fallback branches almost all dead | | |
| | URL JSONL shape | same parser handles | emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) | parser would work the same on URL data | | |
| | Inline preview size | parser doesn't care | always exactly 5 samples per `instance_examples` array | UI capped at 5 today | | |
| | Total samples per row | n/a | typically 50 (per `instance_count` field), one outlier 18 | only 10% accessible to UI today | | |
| | URL-fetch use case | `fetchInstanceLevelData` exists | `source_url` always emitted | dead code on TS side; no UI consumer | | |
| ### Concrete worked example with quantified scope | |
| Audited 2026-04-28 against `.cache/hf-data/`. Verified by `scripts/verify-instance-level-data.mjs`. | |
| **Prevalence:** | |
| - Total model files: 5,830 | |
| - Files with any `instance_level_data`: **55 (0.94%)** | |
| - Total `(metric Γ model_result)` rows: 86,183 | |
| - Result rows with `instance_level_data`: **712 (0.83%)** | |
| - Total inline preview examples (sum of `instance_examples.length`): **3,532** (always β€5 per row) | |
| - Total full samples (sum of `instance_count`): **66,057** (full set available via `source_url`, not loaded today; ~19Γ larger than what UI currently shows) | |
| **ild-level shape uniformity (712/712 rows):** | |
| - Top-level keys are always exactly `{interaction_type, instance_count, source_url, instance_examples}` | |
| - `interaction_type` is always `"multi_turn"` (no single_turn samples in cache) | |
| **Per-example branch firing rates (3,532 examples):** | |
| - `input`: `input.raw` 100% | |
| - `ground_truth`: `input.reference` 100% | |
| - `response`: `answer_attribution` 97.31%, `messages` 2.49%, `output` 0.20% | |
| - `is_correct`: `evaluation.is_correct` 100% | |
| - `sample_id`: `sample_id` 100% | |
| - `choices`: nothing (always undefined) | |
| The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are **defensive scaffolding** for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches. | |
| **URL JSONL shape verification:** sampled one source_url (`anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...`); first line had identical 18 keys to the inline `instance_examples[0]`, with `input.raw` and `evaluation.is_correct` in expected paths. Pipeline emits the same canonical shape on both inline and URL paths. | |
| ## Notes for pipeline implementer | |
| - The pipeline already emits a canonical per-sample shape consistently. **No structural change needed to current emission.** The migration is to make this shape an explicit guarantee, not to change what's being emitted. | |
| - Suggested guarantee: every `instance_examples[i]` (inline AND in JSONL at `source_url`) has at minimum `{sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}`. | |
| - Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract `raw.input.raw`, `raw.input.reference`, `raw.evaluation.is_correct`, `raw.sample_id` as direct field reads. The response field still needs the 3-branch fallback (answer_attribution β messages β output) until pipeline emits a single normalized `response` field. | |
| - **Do NOT pre-ingest the URL JSONL into pipeline parquet** (per the architecture choice). The runtime UI fetches `source_url` on demand; this is the orphaned `fetchInstanceLevelData`'s intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it. | |
| - The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are `multi_turn`) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (`tests/pipeline-contract.test.ts`). | |
| ## Migration checklist | |
| - [x] Spec written | |
| - [x] Tests cover each rule branch (`tests/transformations/instance-level-data.test.ts`) | |
| - [x] Audit script (`scripts/verify-instance-level-data.mjs`) | |
| - [ ] Filed with pipeline owner with the spec + tests + audit script as acceptance criterion | |
| - [ ] Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting `every instance_example has input.raw, sample_id, evaluation.is_correct`) | |
| - [ ] TS deleted: `parseInstanceLevelData` shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. `fetchInstanceLevelData` stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships. | |
| ## Future product decisions (deferred) | |
| - **"Show all samples" UI feature** β would un-orphan `fetchInstanceLevelData` and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope. | |
| - **Cross-model sample querying / search / filter** β would require `instance_samples.parquet` and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks. | |
| - **Single-turn samples** β pipeline currently only emits `interaction_type: multi_turn`. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat `input` strings, `prompt`/`question` fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again. | |