general-eval-card / notes /transformations /12-instance-level-data.md
Jenny Chim
Deploy DuckDB-backed frontend to
da8db3e
|
raw
history blame
16.6 kB

Per-instance JSONL normalization

Drafted 2026-04-28. Migration item #7 in notes/migration-plan.md.

Framing reminder

We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire β€” pipeline emits a single canonical shape β€” but the parser preserves them as defensive scaffolding for older or different harness output formats.

Architecture choice (cleaning-only, no SQL involvement): the inline instance_examples preview (≀5 samples per result) gets normalized in pipeline; the per-result source_url pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned fetchInstanceLevelData (lib/hf-data.ts:890-917) gets re-wired to a future "show all samples" UI feature rather than deleted.

Rule (as TS implements it today)

parseInstanceLevelData (lib/hf-data.ts:933-1043)

Pure function: takes a JSON object, walks instance_examples[], returns SampleResult[].

Top-level guard:

  • If data is null/non-object β†’ return []
  • If data.instance_examples is array β†’ use it
  • Else if data itself is array β†’ use it
  • Else β†’ []

Per-example field extraction (each is a fallback chain):

input (string) β€” first non-empty wins:

  1. raw.input is string β†’ use as-is
  2. raw.input.raw is set β†’ String(raw.input.raw)
  3. raw.prompt β†’ use as-is
  4. raw.question β†’ use as-is
  5. raw.doc.question β†’ use as-is
  6. raw.doc exists β†’ JSON.stringify(raw.doc).slice(0, 500)
  7. (none) β†’ empty string

ground_truth (string | undefined) β€” first non-null wins:

  1. raw.input.reference is array β†’ array.join(", "); else β†’ String(...)
  2. raw.ground_truth β†’ String(...)
  3. raw.target β†’ String(...)
  4. raw.gold β†’ String(...)
  5. raw.doc.answer β†’ String(...)
  6. (none) β†’ undefined

response (string) β€” first non-empty wins:

  1. raw.output is set β†’ string-as-is or JSON.stringify(...)
  2. raw.response β†’ use as-is
  3. raw.model_output β†’ use as-is
  4. raw.answer_attribution is non-empty array β†’ take last element's extracted_value (or empty string)
  5. raw.messages is non-empty array β†’ reverse + find last assistant message β†’ string content or stringified
  6. raw.filtered_resps[0][0] β†’ use
  7. raw.resps[0][0] β†’ use
  8. (none) β†’ empty string

is_correct (boolean | undefined) β€” first defined wins:

  1. raw.evaluation.is_correct (boolean)
  2. raw.is_correct (boolean)
  3. raw.metrics.exact_match === 1 β†’ true; === 0 β†’ false; else undefined
  4. (none) β†’ undefined

metadata (object | undefined) β€” merged from (in order):

  • raw.evaluation (if object)
  • raw.performance (if object)
  • raw.metadata (if object)
  • raw.metrics (if object)
  • If merged object is empty β†’ undefined

sample_id (string) β€” first non-null wins:

  1. raw.sample_id β†’ as-is
  2. raw.doc_id β†’ as-is
  3. raw.id β†’ as-is
  4. (none) β†’ String(arrayIndex) (positional fallback)

choices (any | undefined) β€” first non-null wins:

  1. raw.choices
  2. raw.doc.choices
  3. (none) β†’ undefined

If a row's first-pass map returns null (i.e. raw was null or non-object), it's filtered out via .filter(s => s !== null).

fetchInstanceLevelData (lib/hf-data.ts:890-917) β€” currently orphaned

Takes a (url, limit?). Fetches the URL via fetch(). Splits text on newlines (filters empty lines). For each line up to limit (or all if no limit), tries JSON.parse; skips malformed lines. Wraps the parsed array as { instance_examples: parsed } and passes to parseInstanceLevelData. Returns SampleResult[] or [] on any error.

Active call sites: zero. Verified by grep across app/, components/, scripts/, other lib/ files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.

The function is intended for "load more / show all samples" UI feature β€” pipeline ships a source_url per row pointing to the full JSONL (typically 50 samples), but the inline instance_examples preview only carries 5. The fetcher would let UI request the full set on user demand. This UI feature has never shipped.

Classification

  • Unconditional normalization. The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
  • Cleaning β†’ pipeline. Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
  • NOT reshape. Per the architecture choice in the framing note above, samples stay as on-demand fetches via source_url; no parquet instance_samples table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)

Inputs and expected outputs

Group A β€” Pipeline-canonical shape (the only shape that fires in production today)

Input shape (from cache result.instance_level_data.instance_examples[i]):

{
  "schema_version": "...",
  "evaluation_id": "...",
  "model_id": "...",
  "evaluation_name": "...",
  "sample_id": "...",
  "sample_hash": "...",            // sometimes present
  "interaction_type": "multi_turn",
  "input": { "raw": "..." },        // ALWAYS object with .raw in production
  "output": "..." | { ... },        // sometimes
  "messages": [{ role: "...", content: "..." }, ...],  // typically present
  "answer_attribution": [..., { "extracted_value": "..." }],  // typically present
  "evaluation": { "is_correct": true|false, ... },  // ALWAYS present
  "performance": { ... },
  "metadata": { ... },
  "token_usage": { ... },
  "error": null | "...",
  "hierarchy": [...]
}

Expected output (SampleResult):

Output field Source path that fires Notes
sample_id raw.sample_id (100% of production) always present
input raw.input.raw (100%) branch #2 in the chain
ground_truth raw.input.reference (100%) branch #1
response raw.answer_attribution (97.31%) OR raw.messages (2.49%) OR raw.output (0.20%) branches #4, #5, #1
is_correct raw.evaluation.is_correct (100%) branch #1
choices undefined (100%) no branch fires; field is unset in production
metadata merged from raw.evaluation, raw.performance, raw.metadata, raw.metrics (always at least 2 of 4 present) merged object

Group B β€” Defensive fallback branches (zero firing rate in current production)

Branch Output field Production hits Origin (presumed)
raw.input (string) input 0 older harness shapes
raw.prompt input 0 lm-eval-harness
raw.question input 0 other harnesses
raw.doc.question input 0 HELM-style
raw.doc (JSON.stringify) input 0 last-resort
raw.ground_truth ground_truth 0 older shapes
raw.target ground_truth 0 classification benchmarks
raw.gold ground_truth 0 older lm-eval
raw.doc.answer ground_truth 0 HELM-style
raw.response response 0 older shapes
raw.model_output response 0 older shapes
raw.filtered_resps[0][0] response 0 lm-eval-harness format
raw.resps[0][0] response 0 lm-eval-harness format
raw.is_correct is_correct 0 flat shape
raw.metrics.exact_match is_correct 0 metric-based correctness
raw.doc_id sample_id 0 HELM-style
raw.id sample_id 0 generic
index fallback sample_id 0 last-resort
raw.choices choices 0 multiple-choice
raw.doc.choices choices 0 HELM multiple-choice

These branches exist for shapes the pipeline currently does not emit. Preserve verbatim until pipeline-side guarantees the canonical shape across all data sources.

Group C β€” fetchInstanceLevelData JSONL parsing edge cases

Input Behavior
URL returns !res.ok (404, 500, etc.) returns [] (no throw)
URL throws (network error) logs warning to console, returns []
Empty body splits to [], returns []
Body with empty lines .filter(line => line.trim()) strips them
Body with malformed JSON line swallowed in inner try-catch; line skipped, processing continues
limit=0 or limit=undefined parses ALL lines
limit > lines.length parses all lines (capped via Math.min)

Current TS implementation

Concern Location Notes
parseInstanceLevelData lib/hf-data.ts:933-1043 The parser; ~110 lines
fetchInstanceLevelData lib/hf-data.ts:890-917 URL fetcher; orphaned (zero callers)
Active call site lib/hf-data.ts:1273 (inside flattenHierarchyNode) parseInstanceLevelData(result.instance_level_data) β†’ inlineSamples
Internal call site lib/hf-data.ts:912 inside fetchInstanceLevelData itself, recursive call to the parser
SampleResult type lib/benchmark-schema.ts:135 Output shape definition
Output field BenchmarkEvaluation.detailed_evaluation_results_per_samples lib/benchmark-schema.ts:33

Caller chain for parseInstanceLevelData

getModelSummaryById (lib/model-data.ts:1490+) β†’ flattenModelEvaluations β†’ flattenHierarchyNode (lib/hf-data.ts:1273) β†’ parseInstanceLevelData(result.instance_level_data) β†’ set as inlineSamples on each variant bucket β†’ propagated to BenchmarkEvaluation.detailed_evaluation_results_per_samples.

UI consumers (read data.detailed_evaluation_results_per_samples):

  • components/benchmark-detail.tsx:3869 β€” random sample for preview block
  • components/benchmark-detail.tsx:4174-4216 β€” sample preview UI in benchmark detail
  • components/benchmark-detail.tsx:4982 β€” variant-level sample availability check
  • components/benchmark-detail.tsx:5284-5286 β€” variant sample picker
  • components/benchmark-detail.tsx:5569-5608 β€” sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion

Caller chain for fetchInstanceLevelData

None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.

Pipeline status

Side-by-side comparison

Aspect TS (this spec) Pipeline today Result for users
Inline preview shape parser handles many variants emits ONE canonical shape (input.raw, evaluation.is_correct, etc.) parser's fallback branches almost all dead
URL JSONL shape same parser handles emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) parser would work the same on URL data
Inline preview size parser doesn't care always exactly 5 samples per instance_examples array UI capped at 5 today
Total samples per row n/a typically 50 (per instance_count field), one outlier 18 only 10% accessible to UI today
URL-fetch use case fetchInstanceLevelData exists source_url always emitted dead code on TS side; no UI consumer

Concrete worked example with quantified scope

Audited 2026-04-28 against .cache/hf-data/. Verified by scripts/verify-instance-level-data.mjs.

Prevalence:

  • Total model files: 5,830
  • Files with any instance_level_data: 55 (0.94%)
  • Total (metric Γ— model_result) rows: 86,183
  • Result rows with instance_level_data: 712 (0.83%)
  • Total inline preview examples (sum of instance_examples.length): 3,532 (always ≀5 per row)
  • Total full samples (sum of instance_count): 66,057 (full set available via source_url, not loaded today; ~19Γ— larger than what UI currently shows)

ild-level shape uniformity (712/712 rows):

  • Top-level keys are always exactly {interaction_type, instance_count, source_url, instance_examples}
  • interaction_type is always "multi_turn" (no single_turn samples in cache)

Per-example branch firing rates (3,532 examples):

  • input: input.raw 100%
  • ground_truth: input.reference 100%
  • response: answer_attribution 97.31%, messages 2.49%, output 0.20%
  • is_correct: evaluation.is_correct 100%
  • sample_id: sample_id 100%
  • choices: nothing (always undefined)

The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are defensive scaffolding for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.

URL JSONL shape verification: sampled one source_url (anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...); first line had identical 18 keys to the inline instance_examples[0], with input.raw and evaluation.is_correct in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.

Notes for pipeline implementer

  • The pipeline already emits a canonical per-sample shape consistently. No structural change needed to current emission. The migration is to make this shape an explicit guarantee, not to change what's being emitted.
  • Suggested guarantee: every instance_examples[i] (inline AND in JSONL at source_url) has at minimum {sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}.
  • Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract raw.input.raw, raw.input.reference, raw.evaluation.is_correct, raw.sample_id as direct field reads. The response field still needs the 3-branch fallback (answer_attribution β†’ messages β†’ output) until pipeline emits a single normalized response field.
  • Do NOT pre-ingest the URL JSONL into pipeline parquet (per the architecture choice). The runtime UI fetches source_url on demand; this is the orphaned fetchInstanceLevelData's intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it.
  • The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are multi_turn) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (tests/pipeline-contract.test.ts).

Migration checklist

  • Spec written
  • Tests cover each rule branch (tests/transformations/instance-level-data.test.ts)
  • Audit script (scripts/verify-instance-level-data.mjs)
  • Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
  • Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting every instance_example has input.raw, sample_id, evaluation.is_correct)
  • TS deleted: parseInstanceLevelData shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. fetchInstanceLevelData stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.

Future product decisions (deferred)

  • "Show all samples" UI feature β€” would un-orphan fetchInstanceLevelData and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope.
  • Cross-model sample querying / search / filter β€” would require instance_samples.parquet and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks.
  • Single-turn samples β€” pipeline currently only emits interaction_type: multi_turn. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat input strings, prompt/question fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.