Spaces:

evaleval
/

general-eval-card

Running

App Files Files Community

general-eval-card / notes /transformations /12-instance-level-data.md

Jenny Chim

Deploy DuckDB-backed frontend to

da8db3e 24 days ago

preview code

raw

history blame

16.6 kB

Per-instance JSONL normalization

Drafted 2026-04-28. Migration item #7 in notes/migration-plan.md.

Framing reminder

We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire — pipeline emits a single canonical shape — but the parser preserves them as defensive scaffolding for older or different harness output formats.

Architecture choice (cleaning-only, no SQL involvement): the inline instance_examples preview (≤5 samples per result) gets normalized in pipeline; the per-result source_url pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned fetchInstanceLevelData (lib/hf-data.ts:890-917) gets re-wired to a future "show all samples" UI feature rather than deleted.

Rule (as TS implements it today)

`parseInstanceLevelData` (`lib/hf-data.ts:933-1043`)

Pure function: takes a JSON object, walks instance_examples[], returns SampleResult[].

Top-level guard:

If data is null/non-object → return []
If data.instance_examples is array → use it
Else if data itself is array → use it
Else → []

Per-example field extraction (each is a fallback chain):

input (string) — first non-empty wins:

raw.input is string → use as-is
raw.input.raw is set → String(raw.input.raw)
raw.prompt → use as-is
raw.question → use as-is
raw.doc.question → use as-is
raw.doc exists → JSON.stringify(raw.doc).slice(0, 500)
(none) → empty string

ground_truth (string | undefined) — first non-null wins:

raw.input.reference is array → array.join(", "); else → String(...)
raw.ground_truth → String(...)
raw.target → String(...)
raw.gold → String(...)
raw.doc.answer → String(...)
(none) → undefined

response (string) — first non-empty wins:

raw.output is set → string-as-is or JSON.stringify(...)
raw.response → use as-is
raw.model_output → use as-is
raw.answer_attribution is non-empty array → take last element's extracted_value (or empty string)
raw.messages is non-empty array → reverse + find last assistant message → string content or stringified
raw.filtered_resps[0][0] → use
raw.resps[0][0] → use
(none) → empty string

is_correct (boolean | undefined) — first defined wins:

raw.evaluation.is_correct (boolean)
raw.is_correct (boolean)
raw.metrics.exact_match === 1 → true; === 0 → false; else undefined
(none) → undefined

metadata (object | undefined) — merged from (in order):

raw.evaluation (if object)
raw.performance (if object)
raw.metadata (if object)
raw.metrics (if object)
If merged object is empty → undefined

sample_id (string) — first non-null wins:

raw.sample_id → as-is
raw.doc_id → as-is
raw.id → as-is
(none) → String(arrayIndex) (positional fallback)

choices (any | undefined) — first non-null wins:

raw.choices
raw.doc.choices
(none) → undefined

If a row's first-pass map returns null (i.e. raw was null or non-object), it's filtered out via .filter(s => s !== null).

`fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) — currently orphaned

Takes a (url, limit?). Fetches the URL via fetch(). Splits text on newlines (filters empty lines). For each line up to limit (or all if no limit), tries JSON.parse; skips malformed lines. Wraps the parsed array as { instance_examples: parsed } and passes to parseInstanceLevelData. Returns SampleResult[] or [] on any error.

Active call sites: zero. Verified by grep across app/, components/, scripts/, other lib/ files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.

The function is intended for "load more / show all samples" UI feature — pipeline ships a source_url per row pointing to the full JSONL (typically 50 samples), but the inline instance_examples preview only carries 5. The fetcher would let UI request the full set on user demand. This UI feature has never shipped.

Classification

Unconditional normalization. The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
Cleaning → pipeline. Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
NOT reshape. Per the architecture choice in the framing note above, samples stay as on-demand fetches via source_url; no parquet instance_samples table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)

Inputs and expected outputs

Group A — Pipeline-canonical shape (the only shape that fires in production today)

Input shape (from cache result.instance_level_data.instance_examples[i]):

{
  "schema_version": "...",
  "evaluation_id": "...",
  "model_id": "...",
  "evaluation_name": "...",
  "sample_id": "...",
  "sample_hash": "...",            // sometimes present
  "interaction_type": "multi_turn",
  "input": { "raw": "..." },        // ALWAYS object with .raw in production
  "output": "..." | { ... },        // sometimes
  "messages": [{ role: "...", content: "..." }, ...],  // typically present
  "answer_attribution": [..., { "extracted_value": "..." }],  // typically present
  "evaluation": { "is_correct": true|false, ... },  // ALWAYS present
  "performance": { ... },
  "metadata": { ... },
  "token_usage": { ... },
  "error": null | "...",
  "hierarchy": [...]
}

Expected output (SampleResult):

Output field	Source path that fires	Notes
`sample_id`	`raw.sample_id` (100% of production)	always present
`input`	`raw.input.raw` (100%)	branch #2 in the chain
`ground_truth`	`raw.input.reference` (100%)	branch #1
`response`	`raw.answer_attribution` (97.31%) OR `raw.messages` (2.49%) OR `raw.output` (0.20%)	branches #4, #5, #1
`is_correct`	`raw.evaluation.is_correct` (100%)	branch #1
`choices`	`undefined` (100%)	no branch fires; field is unset in production
`metadata`	merged from `raw.evaluation`, `raw.performance`, `raw.metadata`, `raw.metrics` (always at least 2 of 4 present)	merged object

Group B — Defensive fallback branches (zero firing rate in current production)

Branch	Output field	Origin (presumed)
`raw.input` (string)	input	older harness shapes
`raw.prompt`	input	lm-eval-harness
`raw.question`	input	other harnesses
`raw.doc.question`	input	HELM-style
`raw.doc` (JSON.stringify)	input	last-resort
`raw.ground_truth`	ground_truth	older shapes
`raw.target`	ground_truth	classification benchmarks
`raw.gold`	ground_truth	older lm-eval
`raw.doc.answer`	ground_truth	HELM-style
`raw.response`	response	older shapes
`raw.model_output`	response	older shapes
`raw.filtered_resps[0][0]`	response	lm-eval-harness format
`raw.resps[0][0]`	response	lm-eval-harness format
`raw.is_correct`	is_correct	flat shape
`raw.metrics.exact_match`	is_correct	metric-based correctness
`raw.doc_id`	sample_id	HELM-style
`raw.id`	sample_id	generic
index fallback	sample_id	last-resort
`raw.choices`	choices	multiple-choice
`raw.doc.choices`	choices	HELM multiple-choice

These branches exist for shapes the pipeline currently does not emit. Preserve verbatim until pipeline-side guarantees the canonical shape across all data sources.

Group C — `fetchInstanceLevelData` JSONL parsing edge cases

Input	Behavior
URL returns `!res.ok` (404, 500, etc.)	returns `[]` (no throw)
URL throws (network error)	logs warning to console, returns `[]`
Empty body	splits to `[]`, returns `[]`
Body with empty lines	`.filter(line => line.trim())` strips them
Body with malformed JSON line	swallowed in inner try-catch; line skipped, processing continues
`limit=0` or `limit=undefined`	parses ALL lines
`limit > lines.length`	parses all lines (capped via `Math.min`)

Current TS implementation

Concern	Location	Notes
`parseInstanceLevelData`	`lib/hf-data.ts:933-1043`	The parser; ~110 lines
`fetchInstanceLevelData`	`lib/hf-data.ts:890-917`	URL fetcher; orphaned (zero callers)
Active call site	`lib/hf-data.ts:1273` (inside `flattenHierarchyNode`)	`parseInstanceLevelData(result.instance_level_data)` → `inlineSamples`
Internal call site	`lib/hf-data.ts:912`	inside `fetchInstanceLevelData` itself, recursive call to the parser
`SampleResult` type	`lib/benchmark-schema.ts:135`	Output shape definition
Output field	`BenchmarkEvaluation.detailed_evaluation_results_per_samples`	`lib/benchmark-schema.ts:33`

Caller chain for `parseInstanceLevelData`

getModelSummaryById (lib/model-data.ts:1490+) → flattenModelEvaluations → flattenHierarchyNode (lib/hf-data.ts:1273) → parseInstanceLevelData(result.instance_level_data) → set as inlineSamples on each variant bucket → propagated to BenchmarkEvaluation.detailed_evaluation_results_per_samples.

UI consumers (read data.detailed_evaluation_results_per_samples):

components/benchmark-detail.tsx:3869 — random sample for preview block
components/benchmark-detail.tsx:4174-4216 — sample preview UI in benchmark detail
components/benchmark-detail.tsx:4982 — variant-level sample availability check
components/benchmark-detail.tsx:5284-5286 — variant sample picker
components/benchmark-detail.tsx:5569-5608 — sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion

Caller chain for `fetchInstanceLevelData`

None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.

Pipeline status

Side-by-side comparison

Aspect	TS (this spec)	Pipeline today	Result for users
Inline preview shape	parser handles many variants	emits ONE canonical shape (`input.raw`, `evaluation.is_correct`, etc.)	parser's fallback branches almost all dead
URL JSONL shape	same parser handles	emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28)	parser would work the same on URL data
Inline preview size	parser doesn't care	always exactly 5 samples per `instance_examples` array	UI capped at 5 today
Total samples per row	n/a	typically 50 (per `instance_count` field), one outlier 18	only 10% accessible to UI today
URL-fetch use case	`fetchInstanceLevelData` exists	`source_url` always emitted	dead code on TS side; no UI consumer

Concrete worked example with quantified scope

Audited 2026-04-28 against .cache/hf-data/. Verified by scripts/verify-instance-level-data.mjs.

Prevalence:

Total model files: 5,830
Files with any instance_level_data: 55 (0.94%)
Total (metric × model_result) rows: 86,183
Result rows with instance_level_data: 712 (0.83%)
Total inline preview examples (sum of instance_examples.length): 3,532 (always ≤5 per row)
Total full samples (sum of instance_count): 66,057 (full set available via source_url, not loaded today; ~19× larger than what UI currently shows)

ild-level shape uniformity (712/712 rows):

Top-level keys are always exactly {interaction_type, instance_count, source_url, instance_examples}
interaction_type is always "multi_turn" (no single_turn samples in cache)

Per-example branch firing rates (3,532 examples):

input: input.raw 100%
ground_truth: input.reference 100%
response: answer_attribution 97.31%, messages 2.49%, output 0.20%
is_correct: evaluation.is_correct 100%
sample_id: sample_id 100%
choices: nothing (always undefined)

The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are defensive scaffolding for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.

URL JSONL shape verification: sampled one source_url (anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...); first line had identical 18 keys to the inline instance_examples[0], with input.raw and evaluation.is_correct in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.

Notes for pipeline implementer

The pipeline already emits a canonical per-sample shape consistently. No structural change needed to current emission. The migration is to make this shape an explicit guarantee, not to change what's being emitted.
Suggested guarantee: every instance_examples[i] (inline AND in JSONL at source_url) has at minimum {sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}.
Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract raw.input.raw, raw.input.reference, raw.evaluation.is_correct, raw.sample_id as direct field reads. The response field still needs the 3-branch fallback (answer_attribution → messages → output) until pipeline emits a single normalized response field.
Do NOT pre-ingest the URL JSONL into pipeline parquet (per the architecture choice). The runtime UI fetches source_url on demand; this is the orphaned fetchInstanceLevelData's intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it.
The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are multi_turn) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (tests/pipeline-contract.test.ts).

Migration checklist

Spec written
Tests cover each rule branch (tests/transformations/instance-level-data.test.ts)
Audit script (scripts/verify-instance-level-data.mjs)
Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting every instance_example has input.raw, sample_id, evaluation.is_correct)
TS deleted: parseInstanceLevelData shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. fetchInstanceLevelData stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.

Future product decisions (deferred)

"Show all samples" UI feature — would un-orphan fetchInstanceLevelData and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope.
Cross-model sample querying / search / filter — would require instance_samples.parquet and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks.
Single-turn samples — pipeline currently only emits interaction_type: multi_turn. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat input strings, prompt/question fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.